Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. Python
  2. Python
  3. Session 05: Intro to Data Visualization
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Slides
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Introduction to Data Visualization
    • Why Do We Visualize Data?
    • A Good Visualization Starts with a Question
    • Matching the Question to the Visualization
    • Main Principles of Good Visualization
    • Data-Ink Ratio
    • Chartjunk
  • Poor Visualization vs Good Visualization
    • Example 1: Comparing Categories
    • Example 2: Showing a Trend
    • Example 3: Showing a Distribution
    • Example 4: Showing a Relationship
    • Example 5: Showing Proportions
    • A Simple Decision Guide
  • Common Visualization Mistakes
    • Too many categories
    • Too many colors
    • Unsorted categories
    • Misleading axes
    • Overloaded labels
    • Practical Guidelines for Good Visualization
    • A Good Workflow for Visualization
  • Introduction to Matplotlib with the Instacart DataFrame
    • Why Start with Matplotlib?
    • Matplotlib
    • First Look at the DataFrame
    • The Basic Plotting Workflow in Matplotlib
    • Example 1: Visualizing Orders by Day of Week
    • Example 2: Visualizing Orders by Hour of Day
    • Example 3: Visualizing the Most Popular Departments
    • Example 4: Improving Readability with a Horizontal Bar Chart
    • Example 5: Visualizing the Distribution of Prices
    • Example 6: Visualizing Average Price by Department
    • Example 7: Visualizing Reorder Rate by Region
    • What do we learn here?
    • Example 8: Visualizing Relationship Between Age and Income
  • A Visual Summary of the Workflow
  1. Python
  2. Python
  3. Session 05: Intro to Data Visualization

Session 05: Intro to Data Visualization

Pandas
Aggregating
Data Visualization

Introduction to Data Visualization

Data visualization is not only about making charts look attractive. It is about helping the reader understand patterns, comparisons, distributions, relationships, and trends with the least possible confusion.

A good visualization should answer a question clearly. A poor visualization may still contain the same data, but it forces the audience to work harder, notice the wrong things, or misunderstand the message.

In this section, we begin with the theory of visualization before applying it to the Instacart project.

NoteChart Chooser

Checkout the this dashbaord, which is going to help us to choose proper chart!

Why Do We Visualize Data?

Visualizations help us do the following:

  • summarize large amounts of information
  • identify patterns quickly
  • compare categories
  • detect trends over time
  • notice unusual values or outliers
  • communicate findings to others

A table is often precise, but a chart is often more effective for seeing structure.

A Good Visualization Starts with a Question

Before selecting a chart, we must ask:

  • What exactly am I trying to show?
  • Am I comparing categories?
  • Am I showing a trend over time?
  • Am I showing a distribution?
  • Am I showing a relationship between variables?
  • Am I highlighting part-to-whole structure?

The chart type must match the analytical question.

Matching the Question to the Visualization

Analytical Question Recommended Visualization
Compare categories bar chart
Show trend over time line chart
Show distribution of one variable histogram
Show relationship between two numeric variables scatter plot
Show proportions or shares bar chart, stacked bar chart
Show ranking sorted bar chart
Show matrix-like comparisons heatmap
Show exact numbers only table

Main Principles of Good Visualization

  1. Clarity over decoration
    • The main goal is communication, not visual effects.
    • A chart should reduce cognitive effort, not increase it.
  2. Show the data, not the software: A visualization is not a place to demonstrate all formatting options. Extra effects often distract from the message.
  3. Use the simplest chart that answers the question: If a bar chart communicates the message, there is no reason to use a more complicated chart.
  4. Keep comparisons easy: The reader should be able to compare values quickly.
  5. Labels, scales, and titles matter A chart without clear labeling is difficult to interpret, even if the graph itself is correct.

Data-Ink Ratio

One of the most influential ideas in visualization comes from Edward Tufte: the data-ink ratio.

The idea is simple:

\[ \text{Data-Ink Ratio} = \frac{\text{Ink used to show data}}{\text{Total ink used in the graphic}} \]

Important

A good chart should maximize the amount of visual space used to communicate the data and minimize unnecessary visual elements.

More about data-ink here

Unnecessary elements may include:

  • heavy borders
  • excessive gridlines
  • decorative backgrounds
  • 3D effects
  • shadows
  • distracting colors
  • redundant labels
Tippurpose

This does not mean charts must be plain or ugly. It means every element should have a purpose.

Chartjunk

Tufte also warned against chartjunk.

Chartjunk refers to visual elements that do not improve understanding and often make it worse.

Examples:

  • 3D bar charts
  • gradient fills
  • patterned backgrounds
  • unnecessary icons
  • overly bright colors
  • duplicate legends and labels
Tip

About chartjunk

Poor Visualization vs Good Visualization

A useful learning approach is to compare bad and good versions of the same chart.

The data can remain identical. What changes is how well the chart communicates.

Example 1: Comparing Categories

Suppose we want to compare the sales or counts of several categories.

Poor choice

A pie chart with too many categories makes comparison difficult.

Why it is poor:

  • angles are harder to compare than lengths
  • labels become crowded
  • small differences are difficult to see

Better choice

A sorted horizontal bar chart is usually better.

Why it is better:

  • category names fit more easily
  • lengths are easier to compare than angles
  • sorting reveals ranking immediately

Demonstration

Poor visualization (unsorted bars):

Better visualization (sorted horizontal bars):

Example 2: Showing a Trend

Suppose we want to show change across time.

Poor choice

A bar chart with many time periods can become crowded and visually heavy.

Better choice

A line chart is better when the main goal is to show direction and trend over time.

Why it is better:

  • continuous change is easier to read
  • peaks and declines become more obvious
  • the chart emphasizes movement rather than isolated bars

Demonstration

Poor visualization (bars for trend):

Better visualization (line for trend):

Example 3: Showing a Distribution

Suppose we want to understand how values are spread.

Poor choice

A bar chart of every unique numeric value often becomes cluttered and hides the overall shape.

Better choice

A histogram groups values into bins and helps reveal:

  • skewness
  • concentration
  • spread
  • possible outliers

Demonstration

Poor visualization (individual bars):

Better visualization (histogram):

Example 4: Showing a Relationship

Suppose we want to see how two numeric variables move together.

Poor choice

Two separate bar charts do not show the relationship directly.

Better choice

A scatter plot makes the relationship visible.

It helps reveal:

  • positive association
  • negative association
  • clusters
  • outliers

Demonstration

Poor visualization (separate charts):

Better visualization (scatter):

Example 5: Showing Proportions

Suppose we want to compare shares across groups.

Poor choice

A pie chart, especially with many slices or similar sizes, makes comparison difficult.

Better choice

A bar chart or normalized stacked bar chart is often clearer.

Demonstration

Poor visualization (many-slice pie chart):

Better visualization (bar chart for proportions):

A Simple Decision Guide

flowchart TD
    A[What do you want to show?] --> B[Compare categories]
    A --> C[Trend over time]
    A --> D[Distribution]
    A --> E[Relationship]
    A --> F[Composition]

    B --> B1[Bar chart]
    C --> C1[Line chart]
    D --> D1[Histogram]
    E --> E1[Scatter plot]
    F --> F1[Bar chart or stacked bar chart]

Common Visualization Mistakes

Too many categories

If a chart has too many categories, the message becomes hard to read.

A common solution is:

  • show top categories only
  • group the rest into Other
  • sort values meaningfully

Example

Too many colors

Color should support meaning, not distract attention.

Use color when:

  • distinguishing groups
  • highlighting one important category
  • showing intensity or value

Avoid random color usage.

Example

Unsorted categories

When categories are not sorted, comparison becomes harder.

Sorted bars usually improve readability.

Example

Misleading axes

Axis scales can distort perception.

Be careful with:

  • truncated axes in bar charts
  • inconsistent intervals
  • unclear numeric scales

Example

Overloaded labels

Too many labels can make a chart unreadable.

Keep only labels that help interpretation.

Example

Practical Guidelines for Good Visualization

When building a chart, follow these guidelines:

  • choose the chart type based on the analytical question
  • remove unnecessary decoration
  • keep titles informative and specific
  • label axes clearly
  • sort categories when ranking matters
  • use color intentionally, not randomly
  • avoid 3D effects and chartjunk
  • prefer readability over novelty

A Good Workflow for Visualization

A useful workflow is:

  • start from the analytical question
  • choose the simplest chart that answers it
  • remove unnecessary visual elements
  • check whether the message is immediately clear
  • revise the chart if the reader must work too hard to understand it

Introduction to Matplotlib with the Instacart DataFrame

Now that we have created the final instacart DataFrame, we can begin the visualization part of the course using real project data rather than artificial examples.

In this section, we introduce matplotlib, which is one of the foundational visualization libraries in Python. It gives us direct control over chart creation and helps us understand the building blocks behind many other visualization tools.

We will start with simple visualizations based on the instacart dataset and connect each chart to a specific analytical question.

TipDocumentation

Matplotlib Documentation you can find here

Why Start with Matplotlib?

matplotlib is important because:

  • it is one of the core plotting libraries in Python
  • many other libraries are built on top of it
  • it helps students understand how charts are constructed
  • it gives more control over titles, labels, axes, and figure size
  • it is useful for analytical reporting and notebook-based analysis

Even if students later use seaborn, plotly, or dashboard tools, understanding matplotlib remains valuable.

Matplotlib

Before plotting, we need to import the library.

import pandas as pd
import matplotlib.pyplot as plt

If the instacart.csv file has already been saved, we can load it as follows.

df_instacart = pd.read_parquet("../data/processed/instacart.parquet")
df_instacart.head()
order_id order_number order_dow order_hour_of_day days_since_prior_order add_to_cart_order reordered product_name prices department ... Surname Gender state Age date_joined n_dependants fam_status income region division
0 1187899 11 4 8 14.0 1 1 Soda 9.0 beverages ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
1 1187899 11 4 8 14.0 2 1 Organic String Cheese 8.6 dairy eggs ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
2 1187899 11 4 8 14.0 3 1 0% Greek Strained Yogurt 12.6 dairy eggs ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
3 1187899 11 4 8 14.0 4 1 XL Pick-A-Size Paper Towel Rolls 1.0 household ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
4 1187899 11 4 8 14.0 5 1 Milk Chocolate Almonds 6.8 snacks ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central

5 rows × 22 columns

First Look at the DataFrame

Before plotting, it is useful to remind ourselves what kind of table we are working with.

df_instacart.shape
(1384706, 22)
df_instacart.columns
Index(['order_id', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order', 'add_to_cart_order', 'reordered',
       'product_name', 'prices', 'department', 'aisle', 'First Name',
       'Surname', 'Gender', 'state', 'Age', 'date_joined', 'n_dependants',
       'fam_status', 'income', 'region', 'division'],
      dtype='str')
df_instacart.head()
order_id order_number order_dow order_hour_of_day days_since_prior_order add_to_cart_order reordered product_name prices department ... Surname Gender state Age date_joined n_dependants fam_status income region division
0 1187899 11 4 8 14.0 1 1 Soda 9.0 beverages ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
1 1187899 11 4 8 14.0 2 1 Organic String Cheese 8.6 dairy eggs ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
2 1187899 11 4 8 14.0 3 1 0% Greek Strained Yogurt 12.6 dairy eggs ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
3 1187899 11 4 8 14.0 4 1 XL Pick-A-Size Paper Towel Rolls 1.0 household ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
4 1187899 11 4 8 14.0 5 1 Milk Chocolate Almonds 6.8 snacks ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central

5 rows × 22 columns

This tells us that the dataset contains:

  • order-level variables
  • basket-level variables
  • product-level variables
  • customer-level variables

That means the same DataFrame can support many kinds of visualizations.

The Basic Plotting Workflow in Matplotlib

A simple plotting workflow usually includes:

  • selecting or preparing the data
  • creating a figure
  • choosing a chart type
  • adding a title
  • labeling the axes
  • displaying the chart

A very simple structure looks like this:

plt.figure()
plt.bar(x_values, y_values)
plt.title("Chart Title")
plt.xlabel("X Label")
plt.ylabel("Y Label")
plt.show()

We will now apply this logic directly to the instacart data.

Example 1: Visualizing Orders by Day of Week

Analytical question: “On which days of the week are orders placed most frequently?”

This is a category comparison problem, so a bar chart is a reasonable first choice.

Preparing the data

orders_by_day = (
    df_instacart[["order_id", "order_dow"]]
    .drop_duplicates()
    .groupby("order_dow")
    .size()
)

orders_by_day
order_dow
0    27465
1    19672
2    16119
3    15687
4    15959
5    17406
6    18901
dtype: int64

We use .drop_duplicates() because the final instacart table is at the product-within-order level, and we want to count each order only once.

Creating the chart

orders_by_day = (
    df_instacart[["order_id", "order_dow"]]
    .drop_duplicates()
    .groupby("order_dow")
    .size()
)

plt.figure()
plt.bar(orders_by_day.index, orders_by_day.values)
plt.title("Number of Orders by Day of Week")
plt.xlabel("Day of Week")
plt.ylabel("Number of Orders")
plt.show()

What do we learn here?

This is our first real matplotlib chart using the project data.

It shows us:

  • how to prepare grouped data before plotting
  • how to build a bar chart
  • how to add a title and axis labels

It also reinforces an important analytical lesson: when using a line-item dataset, we must think carefully about the correct level of aggregation

Example 2: Visualizing Orders by Hour of Day

Analytical question: At what hours are orders placed most frequently?

This is also a category comparison problem, but now the categories are hours.

Preparing the data

orders_by_hour = (
    df_instacart[["order_id", "order_hour_of_day"]]
    .drop_duplicates()
    .groupby("order_hour_of_day")
    .size()
)

orders_by_hour.head()
order_hour_of_day
0    852
1    507
2    305
3    223
4    218
dtype: int64

Creating the chart

orders_by_hour = (
    df_instacart[["order_id", "order_hour_of_day"]]
    .drop_duplicates()
    .groupby("order_hour_of_day")
    .size()
)

plt.figure()
plt.bar(orders_by_hour.index, orders_by_hour.values)
plt.title("Number of Orders by Hour of Day")
plt.xlabel("Hour of Day")
plt.ylabel("Number of Orders")
plt.show()

Example 3: Visualizing the Most Popular Departments

Analytical question: Which departments appear most often in the dataset?

This is again a category comparison problem, so a bar chart is appropriate.

Preparing the data

top_departments = df_instacart["department"].value_counts().head(10)
top_departments
department
produce            409087
dairy eggs         217051
snacks             118862
beverages          113962
frozen             100426
pantry              81242
bakery              48394
canned goods        46799
deli                44291
dry goods pasta     38713
Name: count, dtype: int64

Creating the chart

top_departments = df_instacart["department"].value_counts().head(10)

plt.figure()
plt.bar(top_departments.index, top_departments.values)
plt.title("Top 10 Departments by Number of Purchased Items")
plt.xlabel("Department")
plt.ylabel("Number of Purchased Items")
plt.xticks(rotation=45)
plt.show()

Why do we rotate the labels?

The department names are longer than simple numeric categories. Rotating labels improves readability.

This is an important practical lesson in matplotlib:

a chart may be correct but still difficult to read if labels are not formatted well

Example 4: Improving Readability with a Horizontal Bar Chart

The previous chart is correct, but category labels can still feel crowded.

A horizontal bar chart is often better when category names are longer.

Creating the improved chart

top_departments_sorted = df_instacart["department"].value_counts().head(10).sort_values()

plt.figure()
plt.barh(top_departments_sorted.index, top_departments_sorted.values)
plt.title("Top 10 Departments by Number of Purchased Items")
plt.xlabel("Number of Purchased Items")
plt.ylabel("Department")
plt.show()

What is better here?

This chart is often easier to read because:

  • the labels fit naturally
  • the category ranking is easier to interpret
  • the chart emphasizes comparison more clearly
Important

visualization is not only about plotting. It is also about choosing the clearest version of a chart.

Example 5: Visualizing the Distribution of Prices

Analytical question: How are product prices distributed?

This is not a category comparison problem. It is a distribution problem.

For distributions, a histogram is usually a better choice than a bar chart.

Preparing the data

df_instacart["prices"].describe()
count    1.384618e+06
mean     1.411711e+01
std      6.802253e+02
min      1.000000e+00
25%      4.300000e+00
50%      7.400000e+00
75%      1.130000e+01
max      9.999900e+04
Name: prices, dtype: float64

Creating the histogram

plt.figure()
plt.hist(df_instacart["prices"].dropna(), bins=10)
plt.title("Distribution of Product Prices")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

Important

At first glance, the histogram appears to show only a single vertical bar. The chart is not wrong, but it is not yet informative.

The main reason is the presence of extreme values in the prices variable. A few unusually large observations stretch the horizontal axis so much that most regular prices become compressed into the first part of the plot.

This is a very common issue in visualization. Before interpreting a histogram, we should always ask whether the variable contains:

  • extreme outliers
  • data entry issues
  • a long right tail
  • unrealistic values compared with the rest of the distribution

There are several ways to handle this problem:

  • using z-scores, where we remove values that are too many standard deviations away from the mean
  • using quantiles, where we trim the extreme tail by keeping only values up to a chosen percentile
  • setting axis limits, where we keep all data but visually zoom into the most informative range
  • using a log scale, where we compress the scale to make very large values less dominant

In this section, we will use the second approach: quantiles.

The main idea is simple:

  • calculate the upper percentile threshold
  • keep only observations below that threshold
  • redraw the histogram on the trimmed data

This does not mean the removed values are unimportant. It simply means that for the purpose of understanding the main distribution, we want to reduce the visual influence of extreme observations.

Alternative 1: Using the 99th percentile

This is a common and practical choice. It removes only the most extreme 1% of observations.

Step 1: Calculate the upper quantile
upper_limit = df_instacart["prices"].quantile(0.99)
upper_limit
np.float64(18.1)
Step 2: Filter the data
df_prices_q = df_instacart[df_instacart["prices"] <= upper_limit]
df_prices_q["prices"].describe()
count    1.370887e+06
mean     7.668451e+00
std      4.039330e+00
min      1.000000e+00
25%      4.200000e+00
50%      7.300000e+00
75%      1.120000e+01
max      1.810000e+01
Name: prices, dtype: float64
Step 3: Plot the trimmed distribution
plt.figure()
plt.hist(df_prices_q["prices"].dropna(), bins=30)
plt.title("Distribution of Product Prices (Up to 99th Percentile)")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

Alternative 2: Using the 95th percentile

This version is stricter. It removes a larger extreme tail and can make the central shape even clearer.

Step 1: Calculate the upper quantile
upper_limit_95 = df_instacart["prices"].quantile(0.95)
upper_limit_95
np.float64(14.3)
Step 2: Filter the data
df_prices_q95 = df_instacart[df_instacart["prices"] <= upper_limit_95]
df_prices_q95["prices"].describe()
count    1.316686e+06
mean     7.373599e+00
std      3.843880e+00
min      1.000000e+00
25%      4.100000e+00
50%      7.100000e+00
75%      1.060000e+01
max      1.430000e+01
Name: prices, dtype: float64
Step 3: Plot the trimmed distribution
plt.figure()
plt.hist(df_prices_q95["prices"].dropna(), bins=30)
plt.title("Distribution of Product Prices (Up to 95th Percentile)")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

Alternative 3: Keeping all rows but zooming the x-axis

In this approach, we do not remove any values. Instead, we limit the visible range of the horizontal axis.

This is useful when:

  • we want to preserve all observations
  • we only want to improve readability
  • we do not want to create a filtered copy of the dataset
plt.figure()
plt.hist(df_instacart["prices"].dropna(), bins=30)
plt.xlim(0, 50)
plt.title("Distribution of Product Prices (Zoomed X-Axis)")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

Alternative 4: Using z-scores

Another common approach is to identify values that are far away from the mean in terms of standard deviations.

The z-score is calculated as:

\[ z = \frac{x - \mu}{\sigma} \]

where:

  • \(x\) is the observation
  • \(\mu\) is the mean
  • \(\sigma\) is the standard deviation

A typical rule is to keep observations with absolute z-score less than 3.

mean_price = df_instacart["prices"].mean()
std_price = df_instacart["prices"].std()

z_scores = (df_instacart["prices"] - mean_price) / std_price

df_prices_z = df_instacart[z_scores.abs() < 3]

plt.figure()
plt.hist(df_prices_z["prices"].dropna(), bins=30)
plt.title("Distribution of Product Prices (Z-Score Filtered)")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

Which method should we prefer?

There is no single universal answer. The right choice depends on the analytical goal.

Method When it is useful Main limitation
Quantile filtering When we want a robust and simple way to trim the extreme tail The cutoff is relative, not based on domain knowledge
Z-score filtering When the data is roughly symmetric and mean-based filtering is acceptable Can be sensitive when the distribution is highly skewed
Axis limits When we want to keep all data but improve readability Extreme values still exist in the data and may still affect interpretation
Log scale When values span multiple orders of magnitude Harder for beginners to interpret

What does this teach us?

This example introduces a second core chart type in matplotlib.

It shows:

  • not every problem should be visualized with bars
  • the chart type must match the analytical question
  • histograms are useful for understanding spread, concentration, and skewness

Example 6: Visualizing Average Price by Department

Analytical question Which departments have the highest average prices?

This requires both aggregation and visualization.

Preparing the data

avg_price_by_department = (
    df_instacart
    .groupby("department")["prices"]
    .mean()
    .sort_values(ascending=True)
)

avg_price_by_department
department
snacks              4.272277
other               7.184457
dry goods pasta     7.388252
household           7.402065
canned goods        7.530263
beverages           7.660526
babies              7.682672
deli                7.768707
international       7.799126
frozen              7.801781
bakery              7.833023
pets                7.867823
pantry              7.955316
personal care       7.989259
produce             7.997862
breakfast           8.090261
alcohol             8.126044
bulk                8.211626
missing             8.599139
meat seafood       16.202349
dairy eggs         48.606962
Name: prices, dtype: float64

Creating the chart

avg_price_by_department = (
    df_instacart
    .groupby("department")["prices"]
    .mean()
    .sort_values(ascending=True)
)

plt.figure()
plt.barh(avg_price_by_department.index, avg_price_by_department.values)
plt.title("Average Product Price by Department")
plt.xlabel("Average Price")
plt.ylabel("Department")
plt.show()

Why is this an important example?

It connects three ideas together:

  • grouping data with Pandas
  • summarizing with .mean()
  • visualizing the result with matplotlib

Example 7: Visualizing Reorder Rate by Region

Analytical question: Which regions have the highest reorder rate?

Preparing the data

reorder_rate_by_region = (
    df_instacart
    .groupby("region")["reordered"]
    .mean()
    .sort_values(ascending=True)
)

reorder_rate_by_region
region
South        0.596392
Midwest      0.598615
West         0.599062
Northeast    0.602097
Name: reordered, dtype: float64

Creating the chart

reorder_rate_by_region = (
    df_instacart
    .groupby("region")["reordered"]
    .mean()
    .sort_values(ascending=True)
)

plt.figure()
plt.barh(reorder_rate_by_region.index, reorder_rate_by_region.values)
plt.title("Reorder Rate by Region")
plt.xlabel("Reorder Rate")
plt.ylabel("Region")
plt.show()

What do we learn here?

This chart introduces an important interpretation point:

a mean of a binary variable such as reordered can be interpreted as a proportion or rate

That is a powerful idea for analytical visualization.

Example 8: Visualizing Relationship Between Age and Income

Analytical question: Is there a visible relationship between customer age and income?

For two numeric variables, a scatter plot is often the right starting point.

Preparing the data

customer_profile = df_instacart[
    ["First Name", "Surname", "Age", "income"]
].drop_duplicates()

customer_profile.head()
First Name Surname Age income
0 Linda Nguyen 31 40423
11 Norma Chapman 68 64940
42 Janet Lester 75 115242
51 Peter Villegas 39 89095
60 Anna Allison 32 88603

Creating the chart

customer_profile = df_instacart[
    ["First Name", "Surname", "Age", "income"]
].drop_duplicates()

plt.figure()
plt.scatter(customer_profile["Age"], customer_profile["income"])
plt.title("Customer Age and Income")
plt.xlabel("Age")
plt.ylabel("Income")
plt.show()

Why do we use a deduplicated customer table here?

Because the instacart DataFrame is at the product-within-order level, customer variables repeat many times.

If we want a customer-level relationship, we should remove repeated customer rows first.

This is an essential analytical lesson: the right chart depends not only on the variables, but also on the correct grain of the data

A Visual Summary of the Workflow

flowchart TD
    A[Ask an analytical question] --> B[Prepare the data with Pandas]
    B --> C[Choose the right chart type]
    C --> D[Create the chart with Matplotlib]
    D --> E[Interpret the result]