In the previous session, we learned how to import .csv files into Pandas and perform the first exploratory checks on a dataset. We worked with the structure of a single table and practiced reading rows, understanding columns, and interpreting the basic shape of a DataFrame.
In this session, we move from single-table exploration to multi-table analytical thinking.
Our goal is to understand the Instacart dataset table by table, identify what each table represents, ask meaningful analytical questions, and then connect the tables using pd.merge().
This reflects real analytical work. In practice, data is often stored in several related tables rather than in one large spreadsheet. An analyst must first understand each table independently, then combine them carefully in order to answer richer business questions.
For this session, we work with the following files:
orders.csv
order_products_train.csv
products.csv
departments_t.csv
aisles.csv
Important
Here you can find the Instacart Market Basket Analysis dataset and its related documentation.
Before performing any analysis, we need to understand what each table represents.
Overview of the Tables
Table
Analytical Role
Grain
Main Key
orders
Behavioral table
One row per order
order_id
order_products_train
Transaction bridge table
One row per product within an order
no single unique key
products
Product master table
One row per product
product_id
departments_t
Department lookup table
One row per department
department_id
aisles
Aisle lookup table
One row per aisle
aisle_id
The orders Table
The orders table is the behavioral table of the dataset.
Each row represents one order placed by one user.
df_orders.head()
order_id
user_id
eval_set
order_number
order_dow
order_hour_of_day
days_since_prior_order
0
2539329
1
prior
1
2
8
NaN
1
2398795
1
prior
2
3
7
15.0
2
473747
1
prior
3
3
12
21.0
3
2254736
1
prior
4
4
7
29.0
4
431534
1
prior
5
4
15
28.0
Important columns:
order_id: unique order identifier
user_id: identifier of the customer
eval_set: indicates whether the order belongs to prior, train, or test (we are going to use train)
order_number: sequence number of the order for that user
order_dow: day of week of the order
order_hour_of_day: hour of day of the order
days_since_prior_order: number of days since the customer’s previous order
Analytical questions that can be answered from this table alone:
How many total orders are there?
How many users are represented?
What is the distribution of eval_set?
On which days do customers place orders most frequently?
At what hours do customers place orders most frequently?
Are missing values in days_since_prior_order meaningful?
The order_products_train Table
This is the transaction bridge table for train orders.
Each row represents one product inside one order.
Important columns:
order_id: identifier of the order
product_id: identifier of the product
add_to_cart_order: the sequence in which the product was added to the basket
reordered: indicates whether the item had been ordered before by the same customer
Important
order_id is not unique in this table
product_id is not unique in this table
repeated values are expected because one order can contain many products, and one product can appear in many orders
Analytical questions that can be answered from this table alone:
How many line items are there?
How many unique orders are represented?
How many unique products are represented?
What proportion of items are reordered?
What does the distribution of add_to_cart_order look like?
The products Table
The products table is the product master table.
Each row represents one product.
Important columns:
product_id: unique identifier of the product
product_name: product name
aisle_id: identifier of the aisle
department_id: identifier of the department
Analytical questions that can be answered from this table alone:
How many products exist?
Are product IDs unique?
How many departments are represented in the product catalog?
How many aisles are represented in the product catalog?
Do some product names repeat?
The departments_t Table
This is a lookup table.
Each row represents one department.
Important columns:
department_id: unique department identifier
department: department name
Analytical questions that can be answered from this table alone:
How many departments exist?
Are department IDs unique?
The aisles Table
This is also a lookup table.
Each row represents one aisle.
Important columns:
aisle_id: unique aisle identifier
aisle: aisle name
Analytical questions that can be answered from this table alone:
How many aisles exist?
Are aisle IDs unique?
Relationship Between the Tables
Before merging, we should understand how the tables are connected.
flowchart LR
A[orders] -->|order_id| B[order_products_train]
B -->|product_id| C[products]
C -->|department_id| D[departments_t]
C -->|aisle_id| E[aisles]
This tells us the following:
the orders table contains order-level behavior
the order_products_train table connects orders to products
the products table adds product information
the departments_t and aisles tables make product categories more interpretable
Why We Explore Table by Table First
Before using pd.merge(), it is good analytical practice to inspect each table independently.
This allows us to:
understand the grain of the data
identify key columns
find missing values
check whether IDs are unique where they should be unique
detect possible problems before combining tables
Exploring the orders Table
We begin with the orders table because it is the most intuitive table in the Instacart dataset. It contains behavioral information about when users place orders and how frequently they shop.
Each row in this table represents one order placed by one user.
Previewing the data
df_orders.head()
order_id
user_id
eval_set
order_number
order_dow
order_hour_of_day
days_since_prior_order
0
2539329
1
prior
1
2
8
NaN
1
2398795
1
prior
2
3
7
15.0
2
473747
1
prior
3
3
12
21.0
3
2254736
1
prior
4
4
7
29.0
4
431534
1
prior
5
4
15
28.0
The output helps us answer two important structural questions:
How many rows and columns does the table have?
What variables are available for analysis?
Inspecting the schema
A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.
For numeric columns, descriptive statistics provide a first summary of the data.
df_orders.describe()
order_id
user_id
order_number
order_dow
order_hour_of_day
days_since_prior_order
count
3.421083e+06
3.421083e+06
3.421083e+06
3.421083e+06
3.421083e+06
3.214874e+06
mean
1.710542e+06
1.029782e+05
1.715486e+01
2.776219e+00
1.345202e+01
1.111484e+01
std
9.875817e+05
5.953372e+04
1.773316e+01
2.046829e+00
4.226088e+00
9.206737e+00
min
1.000000e+00
1.000000e+00
1.000000e+00
0.000000e+00
0.000000e+00
0.000000e+00
25%
8.552715e+05
5.139400e+04
5.000000e+00
1.000000e+00
1.000000e+01
4.000000e+00
50%
1.710542e+06
1.026890e+05
1.100000e+01
3.000000e+00
1.300000e+01
7.000000e+00
75%
2.565812e+06
1.543850e+05
2.300000e+01
5.000000e+00
1.600000e+01
1.500000e+01
max
3.421083e+06
2.062090e+05
1.000000e+02
6.000000e+00
2.300000e+01
3.000000e+01
This is useful for checking the ranges of variables such as:
order_number
order_dow
order_hour_of_day
days_since_prior_order
Note, that from now on we will use pd.options.display.float_format = '{:,.2f}'.format option for data reporting in order to make the value more representative.
df_orders takes significant amount of memory, thus once finishing working with the dataframe, we can delete it:
del df_orders
Summary of the orders table
At this stage, we understand the orders table both structurally and analytically.
It is a behavioral table with one row per order
It helps us study timing and frequency of customer orders
It introduces the important eval_set variable
It provides a good example of meaningful null values
It allows us to prepare the correct subset before merging with the train basket table
Exploring the order_products_train Table
We now move to the order_products_train table. Unlike orders, this is not an order-level table. Instead, it is a transaction bridge table that connects orders to products.
Each row in this table represents one product inside one order.
Previewing the data
df_orders_products_train.head()
order_id
product_id
add_to_cart_order
reordered
0
1
49302
1
1
1
1
11109
2
1
2
1
10246
3
0
3
1
49683
4
0
4
1
43633
5
1
The output helps us answer two important structural questions:
What does one row represent in this table?
Which columns help us connect orders to products?
Inspecting the schema
A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.
This is an especially important table because it will later be used for merging. If key columns such as order_id or product_id have missing values, the merge process may produce incomplete results.
Analytical questions for order_products_train
Even without merging, the order_products_train table already allows us to answer several useful analytical questions.
How many line items are there in total?
How many unique orders are represented?
How many unique products are represented?
What is the distribution of reordered?
What does add_to_cart_order tell us about basket construction?
Are there any suspicious or inconsistent values?
Q1: How many line items exist?
len(df_orders_products_train)
1384617
Q2: How many unique orders are represented in this table?
df_orders_products_train["order_id"].nunique()
131209
Q3: How many unique products are represented in this table?
df_orders_products_train["product_id"].nunique()
39123
Q4: Exploring reorder behavior
The reordered variable is analytically important because it helps us understand repeated purchase behavior.
This helps us understand how the product rows are split into:
This table is especially important because it provides product attributes that will later be attached to the transaction data. Missing values in aisle_id or department_id could weaken the interpretation of merged results.
Analytical questions for products
Even without merging, the products table already allows us to answer several useful analytical questions.
How many products are there in total?
Are product IDs unique?
How many aisles and departments are represented?
Do product names repeat?
Is the product catalog evenly distributed across categories?
Q1: How many unique products exist?
df_products["product_id"].nunique()
49686
Q2: Are product IDs unique?
df_products["product_id"].is_unique
False
Q3: How many unique departments are referenced in the product table?
df_products["department_id"].nunique()
21
Q4: How many unique aisles are referenced in the product table?
df_products["aisle_id"].nunique()
134
Q5: Do some product names repeat?
df_products["product_name"].duplicated().sum()
np.int64(20)
Later, after merging with lookup tables, we will be able to interpret department and aisle information in business language instead of numeric IDs.
Understanding the role of department_id and aisle_id
The products table does not contain department names or aisle names directly. Instead, it contains numeric keys.
Each column has a specific role, and its data type should match its meaning.
df_aisles.dtypes
aisle_id int64
aisle str
dtype: object
Missing values
Missing values should always be checked before moving to deeper analysis.
df_aisles.isna().sum()
aisle_id 0
aisle 0
dtype: int64
Analytical questions for aisles
Even without merging, the aisles table already allows us to answer several useful analytical questions.
How many aisles exist?
Are aisle IDs unique?
Are there any missing aisle labels?
Q1: How many aisles exist?
df_aisles["aisle"].nunique()
134
Q2: Are aisle_id values unique?
df_aisles["aisle_id"].is_unique
True
Q3: Are there missing values in the table?
df_aisles.isna().sum()
aisle_id 0
aisle 0
dtype: int64
Why this table matters for merging
This is also a small table, but it is analytically important.
it translates aisle_id into a readable aisle name
it provides a more detailed grouping level than department
it will later enrich the products table and make category-level analysis more informative
Summary of the aisles table
At this stage, we understand the aisles table both structurally and analytically.
It is a lookup table with one row per aisle
It provides readable aisle names
It will later enrich the products table
Understanding pd.merge()
Until now, we explored each table independently. That was an important first step because good analytical work starts with understanding the structure, meaning, and limitations of each table before combining them.
Now we move to the next stage: merging tables.
In relational data, different pieces of information are stored in different tables.
orders tells us about order behavior
order_products_train tells us which products were included in each order
products gives product attributes
departments_t and aisles translate category IDs into readable labels
If we want to answer richer analytical questions, we need to connect these tables.
In Pandas, the main tool for this is pd.merge().
Before Merge
At this stage, we have explored all tables individually.
orders is a behavioral table with one row per order
order_products_train is a transaction bridge table with one row per product within an order
products is a master table with one row per product
departments_t is a lookup table for department names
aisles is a lookup table for aisle names
SQL and Pandas Connection
If you already know SQL, the logic of pd.merge() should feel familiar.
This tells us whether the merge behaved as expected.
For example:
if all rows are both, then every product found a matching department
if some rows are left_only, then some products have department IDs that do not exist in departments_t
if we ever see right_only in a left join result, that would be unusual in the displayed merged result because left joins retain left rows; however, the right-side categories may still appear through other join types or different validation setups
Important
The indicator argument is especially useful for data quality checks and referential integrity checks.
Investigating unmatched rows
If the merge indicator shows left_only, we should inspect those rows.
At this stage, we have created an enriched product table.
Each product now includes:
product_name
department
aisle
This is much better for analysis than working only with IDs.
Why validate row counts after a merge?
One very important habit is checking whether the number of rows changed unexpectedly.
If departments_t has one row per department, and each product has only one department, then merging products with departments should usually keep the same number of rows.
flowchart LR
A[products] -->|department_id| B[products + departments]
B -->|aisle_id| C[products_enriched]
D[departments_t] --> B
E[aisles] --> C
F[orders_train] -->|order_id| G[train_transactions]
H[order_products_train] -->|order_id| G
G -->|product_id| I[train_analytic]
C -->|product_id| I
Homework | Solution
Step 1 | Downlaoding the Data
In order to have the final version of the dataframe and contintue the main analytical tasks, we need to also merge customers and states dimensions.
data = ['customers.csv', 'states.csv']URL ='https://raw.githubusercontent.com/hovhannisyan91/aca/refs/heads/main/lab/python/data/raw/'ldf = []for i in data: df = pd.read_csv(URL+i)# քանի որ ցանկանում ենք նաև միացնել տվյալները մենք կարող ենք դա անել միանգամից՝ նախ սկզբում պահելով այն լիստի մեջ ldf.append(df)