Session 10: DA with SQL | Advanced SQL Functions

Window Functions

LAG

LEAD

ROW_NUMBER

RANK

DENSE_RANK

Window Functions

A window function is a SQL function that performs a calculation over a set of rows related to the current row, called a window, without collapsing rows.

This is the defining property:

A window function adds information to each row by looking at other rows.

Formally, a window function:

Operates over a window of rows defined by OVER(...)
Preserves the original number of rows
Attaches the computed value back to each row

This distinguishes window functions from:

GROUP BY aggregates (which reduce rows)
Subqueries and CTEs (which restructure data)

Mental model:

Each row keeps its identity, but gains context.

Why Window Functions Matter in Analytics

Analytical questions often require comparison, not just aggregation.

Examples:

How does this order compare to the customer’s average?
What changed since the previous transaction?
What is the rank of this event within a group?
How does this value evolve over time?

Window functions are designed precisely for these tasks.

Window Function Types Layers

Aggregate over a window (e.g., SUM(), AVG(), COUNT())
Value from another row (e.g., LAG(), LEAD())
Ranking functions (e.g., ROW_NUMBER(), RANK(), DENSE_RANK())
Cumulative distributions (e.g., CUME_DIST(), PERCENT_RANK())
String aggregation (e.g., STRING_AGG())
Navigational functions (e.g., FIRST_VALUE(), LAST_VALUE())
Framing & Advanced Window Control (Advanced)

Window Syntax (Generic)

function_name(expression)
OVER (
    PARTITION BY ...
    ORDER BY ...
)

PARTITION BY defines who is compared with whom
ORDER BY defines sequence inside the window
The window determines which rows are visible

Create an Extended Dummy Temporary Table

Before applying window functions to the analytics schema, we use a controlled temporary table with:

Multiple entities
Time ordering
Repeated values (for ranking)
A categorical column (for STRING_AGG)

CREATE TEMP TABLE tmp_sales AS
SELECT *
FROM (
    VALUES
        
        (1,  'A', DATE '2024-01-01', 100, 'online'),
        (2,  'A', DATE '2024-01-02', 120, 'store'),
        (3,  'A', DATE '2024-01-03', 90,  'online'),
        (4,  'A', DATE '2024-01-04', 130, 'store'),

        
        (5,  'B', DATE '2024-01-01', 180, 'store'),
        (6,  'B', DATE '2024-01-02', 200, 'online'),
        (7,  'B', DATE '2024-01-03', 220, 'online'),
        (8,  'B', DATE '2024-01-04', 200, 'store'),

        
        (9,  'C', DATE '2024-01-01', 150, 'online'),
        (10, 'C', DATE '2024-01-02', 150, 'online'),
        (11, 'C', DATE '2024-01-03', 170, 'online'),

        
        (12, 'D', DATE '2024-01-01', 90,  'store'),
        (13, 'D', DATE '2024-01-02', 110, 'store'),

        
        (14, 'E', DATE '2024-01-01', 140, 'store'),
        (15, 'E', DATE '2024-01-02', 160, 'online'),
        (16, 'E', DATE '2024-01-03', 155, 'store')
) AS t(
    sale_id,
    customer_id,
    sale_date,
    amount,
    channel
);

Output

sale_id	customer_id	sale_date	amount	channel
1	A	2024-01-01	100	online
2	A	2024-01-02	120	store
3	A	2024-01-03	90	online
4	A	2024-01-04	130	store
5	B	2024-01-01	180	store
6	B	2024-01-02	200	online
7	B	2024-01-03	220	online
8	B	2024-01-04	200	store
9	C	2024-01-01	150	online
10	C	2024-01-02	150	online
11	C	2024-01-03	170	online
12	D	2024-01-01	90	store
13	D	2024-01-02	110	store
14	E	2024-01-01	140	store
15	E	2024-01-02	160	online
16	E	2024-01-03	155	store

Window Function Layer 1: Simple Aggregate Window Functions

Simple aggregate window functions apply classical aggregation logic (such as SUM, AVG, COUNT, MIN, MAX) without collapsing rows.

They answer the question:

How does this row relate to an aggregate of its group/partition?

Concept

A simple aggregate window function computes an aggregate value over a partition of rows and attaches that value to every row in the partition.

Unlike GROUP BY:

Rows are not reduced
Each row keeps its original granularity
Aggregates become contextual attributes

Compute a group-level metric, then stamp it onto every row in the group.

Common Functions in This Layer

SUM()
AVG()
COUNT()
MIN()
MAX()

All of them become window functions when combined with OVER(...).

Generic Pattern

AGGREGATE_FUNCTION(column)
OVER (
    PARTITION BY grouping_column
)

Example 1: Average Amount per Customer

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    AVG(amount) OVER (PARTITION BY customer_id) AS avg_customer_amount
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	sale_date	amount	avg_customer_amount
1	A	2024-01-01	100	110.000
2	A	2024-01-02	120	110.000
3	A	2024-01-03	90	110.000
4	A	2024-01-04	130	110.000
5	B	2024-01-01	180	200.000
6	B	2024-01-02	200	200.000
7	B	2024-01-03	220	200.000
8	B	2024-01-04	200	200.000
9	C	2024-01-01	150	156.667
10	C	2024-01-02	150	156.667
11	C	2024-01-03	170	156.667
12	D	2024-01-01	90	100.000
13	D	2024-01-02	110	100.000
14	E	2024-01-01	140	151.667
15	E	2024-01-02	160	151.667
16	E	2024-01-03	155	151.667

\(\downarrow\)

Customer A
Amounts: 100 120 90 130
Average: (100 + 120 + 90 + 130) / 4 = 110

Each row sees ALL rows in partition:

[100, 120, 90, 130] → 110
[100, 120, 90, 130] → 110
[100, 120, 90, 130] → 110
[100, 120, 90, 130] → 110

Example 2: Total Spend per Customer

SELECT
    sale_id,
    customer_id,
    amount,
    SUM(amount) OVER (
        PARTITION BY customer_id
    ) AS total_customer_spend
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	amount	total_customer_spend
1	A	100	440.000
2	A	120	440.000
3	A	90	440.000
4	A	130	440.000
5	B	180	800.000
6	B	200	800.000
7	B	220	800.000
8	B	200	800.000
9	C	150	470.000
10	C	150	470.000
11	C	170	470.000
12	D	90	200.000
13	D	110	200.000
14	E	140	455.000
15	E	160	455.000
16	E	155	455.000

\(\downarrow\)

Customer A
Amounts: 100 120 90 130
Total: 100 + 120 + 90 + 130 = 440

Each row sees ALL rows in partition:

[100, 120, 90, 130] → 440
[100, 120, 90, 130] → 440
[100, 120, 90, 130] → 440
[100, 120, 90, 130] → 440

Example 3: Transaction Count per Customer

SELECT
    sale_id,
    customer_id,
    COUNT(*) OVER (
        PARTITION BY customer_id
    ) AS transaction_count
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	transaction_count
1	A	4
2	A	4
3	A	4
4	A	4
5	B	4
6	B	4
7	B	4
8	B	4
9	C	3
10	C	3
11	C	3
12	D	2
13	D	2
14	E	3
15	E	3
16	E	3

\(\downarrow\)

Customer A
Transactions: 4 rows

Each row sees ALL rows in partition:

[1, 2, 3, 4] → 4
[1, 2, 3, 4] → 4
[1, 2, 3, 4] → 4
[1, 2, 3, 4] → 4

Example 4: Minimum and Maximum Amount per Customer

SELECT
    sale_id,
    customer_id,
    amount,
    MIN(amount) OVER (
        PARTITION BY customer_id
    ) AS min_amount,
    MAX(amount) OVER (
        PARTITION BY customer_id
    ) AS max_amount
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	amount	min_amount	max_amount
1	A	100	90.000	130.000
2	A	120	90.000	130.000
3	A	90	90.000	130.000
4	A	130	90.000	130.000
5	B	180	180.000	220.000
6	B	200	180.000	220.000
7	B	220	180.000	220.000
8	B	200	180.000	220.000
9	C	150	150.000	170.000
10	C	150	150.000	170.000
11	C	170	150.000	170.000
12	D	90	90.000	110.000
13	D	110	90.000	110.000
14	E	140	140.000	160.000
15	E	160	140.000	160.000
16	E	155	140.000	160.000

\(\downarrow\)

Customer A
Amounts: 100 120 90 130
Min: 90
Max: 130

Each row sees ALL rows in partition:

[100, 120, 90, 130] → min=90, max=130
[100, 120, 90, 130] → min=90, max=130
[100, 120, 90, 130] → min=90, max=130
[100, 120, 90, 130] → min=90, max=130

Summary of Simple Aggregate Window Functions

Aggregation logic is identical to GROUP BY
The difference is row preservation
These functions create reference baselines
They are often the foundation for:
- Variance analysis
- Normalization
- Z-scores
- Behavioral comparisons

When to Use Simple Aggregate Window Functions

Comparing a row to its group average
Attaching group-level metrics to granular data
Preparing features for modeling
Avoiding joins or subqueries for simple enrichments

This layer forms the foundation for all advanced window logic.

Window Function Layer 2: Statistical Window Functions

Statistical window functions extend simple aggregates by answering distributional questions instead of basic totals or averages.

They answer the question:

Where does this row stand within the distribution of its group?

Concept

A statistical window function computes position, spread, or relative standing of a row within a partition.

Unlike simple aggregates:

They focus on ordering and distribution
They often require ORDER BY
Results depend on relative position, not just totals

These functions are essential for exploratory data analysis, outlier detection, and ranking-based insights.

Common Functions in This Layer

PERCENTILE_CONT()
PERCENTILE_DISC()
MEDIAN()
PERCENT_RANK()
CUME_DIST()

Generic Pattern

STAT_FUNCTION(...)
OVER (
    PARTITION BY grouping_column
    ORDER BY ordering_column
)

Example 1: Median Amount per Customer

SELECT
    customer_id,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount)  AS median_amount
FROM tmp_sales
GROUP BY customer_id;

\(\downarrow\)

customer_id	median_amount
A	110.000
B	200.000
C	150.000
D	100.000
E	155.000

\(\downarrow\)

Customer A
Sorted amounts: 90 100 120 130
Median = (100 + 120) / 2 = 110

Customer C
Sorted amounts: 150 150 170
Median = 150

Example 2: Percent Rank of Each Transaction

SELECT
    sale_id,
    customer_id,
    amount,
    PERCENT_RANK() OVER (
        PARTITION BY customer_id
        ORDER BY amount
    ) AS percent_rank
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	amount	percent_rank
3	A	90	0.000
1	A	100	0.333
2	A	120	0.667
4	A	130	1.000
5	B	180	0.000
6	B	200	0.333
8	B	200	0.333
7	B	220	1.000
9	C	150	0.000
10	C	150	0.000
11	C	170	1.000

\(\downarrow\)

Customer A
Sorted: 90 100 120 130

Position formula:
(rank - 1) / (n - 1)

90  → (1-1)/3 = 0
100 → (2-1)/3 = 0.33
120 → (3-1)/3 = 0.67
130 → (4-1)/3 = 1

Example 3: Cumulative Distribution

SELECT
    sale_id,
    customer_id,
    amount,
    CUME_DIST() OVER (
        PARTITION BY customer_id
        ORDER BY amount
    ) AS cumulative_distribution
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	amount	cumulative_distribution
3	A	90	0.250
1	A	100	0.500
2	A	120	0.750
4	A	130	1.000
5	B	180	0.250
6	B	200	0.750
8	B	200	0.750
7	B	220	1.000

\(\downarrow\)

Customer A
Sorted: 90 100 120 130

CUME_DIST = (# values ≤ current) / total

90  → 1/4 = 0.25
100 → 2/4 = 0.50
120 → 3/4 = 0.75
130 → 4/4 = 1

Example 4: Quartile Assignment

SELECT
    sale_id,
    customer_id,
    amount,
    NTILE(4) OVER (
        PARTITION BY customer_id
        ORDER BY amount
    ) AS quartile
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	amount	quartile
3	A	90	1
1	A	100	2
2	A	120	3
4	A	130	4
5	B	180	1
6	B	200	2
8	B	200	3
7	B	220	4

\(\downarrow\)

Customer A
Sorted: 90 100 120 130

Split into 4 equal buckets:

90  → Q1
100 → Q2
120 → Q3
130 → Q4

Key Observations

Statistical window functions are order-sensitive
They expose distribution shape
Results are relative, not absolute
They are critical for:
- Outlier detection
- Segmentation
- Threshold-based decisions

When to Use Statistical Window Functions

Understanding customer behavior distributions
Identifying top/bottom performers
Creating quantile-based segments
Normalizing values for comparison

This layer bridges descriptive statistics and analytical SQL, preparing the ground for ranking and time-aware functions.

Window Function Layer 3: Value-from-Another-Row Functions

Value-from-another-row window functions allow a row to access values from neighboring rows within the same partition.

They answer the question:

How does this row relate to previous or next rows in sequence?

Concept

A value-from-another-row window function retrieves a value from a different row in the same window, based on ordering.

These functions are inherently time-aware and sequence-aware.

Unlike aggregate or statistical functions:

They do not summarize a group
They compare a row to specific neighboring rows
Ordering is mandatory

Common Functions in This Layer

LAG()
LEAD()
FIRST_VALUE()
LAST_VALUE()

Generic Pattern

VALUE_FUNCTION(column [, offset])
OVER (
    PARTITION BY grouping_column
    ORDER BY ordering_column
)

offset (optional) defines how many rows forward/backward to look
Default offset is 1

Example 1: Previous Transaction Amount (`LAG`)

Business intuition:
“How did this transaction change compared to the previous one?”

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    LAG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS previous_amount
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	sale_date	amount	previous_amount
1	A	2024-01-01	100
2	A	2024-01-02	120	100
3	A	2024-01-03	90	120
4	A	2024-01-04	130	90
5	B	2024-01-01	180
6	B	2024-01-02	200	180
7	B	2024-01-03	220	200
8	B	2024-01-04	200	220
9	C	2024-01-01	150
10	C	2024-01-02	150	150
11	C	2024-01-03	170	150
12	D	2024-01-01	90
13	D	2024-01-02	110	90
14	E	2024-01-01	140
15	E	2024-01-02	160	140
16	E	2024-01-03	155	160

\(\downarrow\)

Customer A
Transactions by date: 100 120 90 130

Current row looks 1 row backward:

100 → previous = NULL
120 → previous = 100
90  → previous = 120
130 → previous = 90

Example 2: Next Transaction Amount (`LEAD`)

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    LEAD(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS next_amount
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	sale_date	amount	next_amount
1	A	2024-01-01	100	120
2	A	2024-01-02	120	90
3	A	2024-01-03	90	130
4	A	2024-01-04	130
5	B	2024-01-01	180	200
6	B	2024-01-02	200	220
7	B	2024-01-03	220	200
8	B	2024-01-04	200
9	C	2024-01-01	150	150
10	C	2024-01-02	150	170
11	C	2024-01-03	170
12	D	2024-01-01	90	110
13	D	2024-01-02	110
14	E	2024-01-01	140	160
15	E	2024-01-02	160	155
16	E	2024-01-03	155

\(\downarrow\)

Customer A
Transactions by date: 100 120 90 130

Current row looks 1 row forward:

100 → next = 120
120 → next = 90
90  → next = 130
130 → next = NULL

Example 3: Change Since Previous Transaction

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    amount
      - LAG(amount) OVER (
            PARTITION BY customer_id
            ORDER BY sale_date
        ) AS amount_change
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	sale_date	amount	amount_change
1	A	2024-01-01	100
2	A	2024-01-02	120	20
3	A	2024-01-03	90	-30
4	A	2024-01-04	130	40
5	B	2024-01-01	180
6	B	2024-01-02	200	20
7	B	2024-01-03	220	20
8	B	2024-01-04	200	-20
9	C	2024-01-01	150
10	C	2024-01-02	150	0
11	C	2024-01-03	170	20
12	D	2024-01-01	90
13	D	2024-01-02	110	20
14	E	2024-01-01	140
15	E	2024-01-02	160	20
16	E	2024-01-03	155	-5

\(\downarrow\)

Customer A
Transactions by date: 100 120 90 130

Formula:
current amount - previous amount

100 → NULL
120 - 100 = 20
90  - 120 = -30
130 - 90  = 40

Example 4: First Transaction Amount per Customer

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    FIRST_VALUE(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS first_amount
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	sale_date	amount	first_amount
1	A	2024-01-01	100	100
2	A	2024-01-02	120	100
3	A	2024-01-03	90	100
4	A	2024-01-04	130	100
5	B	2024-01-01	180	180
6	B	2024-01-02	200	180
7	B	2024-01-03	220	180
8	B	2024-01-04	200	180
9	C	2024-01-01	150	150
10	C	2024-01-02	150	150
11	C	2024-01-03	170	150
12	D	2024-01-01	90	90
13	D	2024-01-02	110	90
14	E	2024-01-01	140	140
15	E	2024-01-02	160	140
16	E	2024-01-03	155	140

\(\downarrow\)

Customer A
Transactions by date: 100 120 90 130

First row in partition = 100

100 → 100
120 → 100
90  → 100
130 → 100

Example 5: Last Transaction Amount per Customer

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    LAST_VALUE(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS last_amount
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	sale_date	amount	last_amount
1	A	2024-01-01	100	130
2	A	2024-01-02	120	130
3	A	2024-01-03	90	130
4	A	2024-01-04	130	130
5	B	2024-01-01	180	200
6	B	2024-01-02	200	200
7	B	2024-01-03	220	200
8	B	2024-01-04	200	200
9	C	2024-01-01	150	170
10	C	2024-01-02	150	170
11	C	2024-01-03	170	170
12	D	2024-01-01	90	110
13	D	2024-01-02	110	110
14	E	2024-01-01	140	155
15	E	2024-01-02	160	155
16	E	2024-01-03	155	155

\(\downarrow\)

Customer A
Transactions by date: 100 120 90 130

Last row in partition = 130

100 → 130
120 → 130
90  → 130
130 → 130

Important

Ordering is mandatory
These functions are directional
They enable:
- Trend analysis
- Change detection
- Time-series feature engineering

When to Use Value-from-Another-Row Functions

Measuring growth or decline
Detecting behavioral changes
Comparing events over time
Preparing lagged features for models

This layer introduces temporal reasoning, which is fundamental for advanced analytics and forecasting.

Window Function Layer 4: Ranking Functions

Ranking window functions assign ordinal positions to rows within a partition based on a specified ordering.

They answer the question:

How does this row rank relative to other rows in its group?

Concept

A ranking window function assigns a numeric rank to each row according to an ORDER BY rule inside a partition.

Unlike statistical functions:

They produce discrete ranks, not continuous measures
They are sensitive to ties
They are commonly used for top-N analysis and leaderboards

Common Functions in This Layer

ROW_NUMBER()
RANK()
DENSE_RANK()

Each function handles ties differently.

Generic Pattern

RANKING_FUNCTION()
OVER (
    PARTITION BY grouping_column
    ORDER BY ordering_column [ASC | DESC]
)

Function Semantics

Function	Handles Ties	Gaps in Ranking	Typical Use
`ROW_NUMBER()`	No	Yes	Unique ordering
`RANK()`	Yes	Yes	Competition ranking
`DENSE_RANK()`	Yes	No	Tier-based ranking

Example 1: Sequential Ordering (`ROW_NUMBER`)

Business intuition:
“What is the chronological position of each transaction?”

SELECT
    sale_id,
    customer_id,
    sale_date,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS row_number
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	sale_date	row_number
1	A	2024-01-01	1
2	A	2024-01-02	2
3	A	2024-01-03	3
4	A	2024-01-04	4
5	B	2024-01-01	1
6	B	2024-01-02	2
7	B	2024-01-03	3
8	B	2024-01-04	4
9	C	2024-01-01	1
10	C	2024-01-02	2
11	C	2024-01-03	3
12	D	2024-01-01	1
13	D	2024-01-02	2
14	E	2024-01-01	1
15	E	2024-01-02	2
16	E	2024-01-03	3

Example 2: Dense Ranking by Amount (`DENSE_RANK`)

SELECT
    sale_id,
    customer_id,
    amount,
    DENSE_RANK() OVER (
        PARTITION BY customer_id
        ORDER BY amount DESC
    ) AS dense_rank_amount
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	amount	dense_rank_amount
4	A	130	1
2	A	120	2
1	A	100	3
3	A	90	4
7	B	220	1
6	B	200	2
8	B	200	2
5	B	180	3
11	C	170	1
9	C	150	2
10	C	150	2
13	D	110	1
12	D	90	2
15	E	160	1
16	E	155	2
14	E	140	3

Example 3: Ranking by Amount with Gaps (`RANK`)

Business intuition: “What is the spending rank of each transaction?”

SELECT
    sale_id,
    customer_id,
    amount,
    RANK() OVER (
        PARTITION BY customer_id
        ORDER BY amount DESC
    ) AS rank_amount
FROM tmp_sales;

\(\downarrow\)

sale_id	customer_id	amount	rank_amount
4	A	130	1
2	A	120	2
1	A	100	3
3	A	90	4
7	B	220	1
6	B	200	2
8	B	200	2
5	B	180	4
11	C	170	1
9	C	150	2
10	C	150	2
13	D	110	1
12	D	90	2
15	E	160	1
16	E	155	2
14	E	140	3

Example 4: Comparison

SELECT
    sale_id,
    customer_id,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id
        ORDER BY amount DESC
    ) AS row_number_amount,
    RANK() OVER (
        PARTITION BY customer_id
        ORDER BY amount DESC
    ) AS rank_amount,
    DENSE_RANK() OVER (
        PARTITION BY customer_id
        ORDER BY amount DESC
    ) AS dense_rank_amount
FROM tmp_sales
ORDER BY customer_id, amount DESC, sale_id;

\(\downarrow\)

sale_id	customer_id	amount	row_number_amount	rank_amount	dense_rank_amount
4	A	130	1	1	1
2	A	120	2	2	2
1	A	100	3	3	3
3	A	90	4	4	4
7	B	220	1	1	1
6	B	200	2	2	2
8	B	200	3	2	2
5	B	180	4	4	3
11	C	170	1	1	1
9	C	150	2	2	2
10	C	150	3	2	2
13	D	110	1	1	1
12	D	90	2	2	2
15	E	160	1	1	1
16	E	155	2	2	2
14	E	140	3	3	3

Summary of Ranking Functions

Important

Ranking always requires ORDER BY
ROW_NUMBER() enforces uniqueness
RANK() leaves gaps after ties
DENSE_RANK() preserves continuity

Practical Analytics Use Cases

Top-N customers or transactions
Identifying best/worst performers
De-duplication with deterministic ordering
Feature engineering for modeling

When to Use Ranking Functions

Creating leaderboards
Selecting first / top records per group
Resolving duplicates deterministically
Segmenting by relative performance

Ranking functions introduce comparative structure, preparing the ground for cumulative and navigational window logic.

Window Function Layer 5: String Aggregation Window Functions

String aggregation window functions allow rows to accumulate textual values from related rows without collapsing the dataset.

They answer the question:

What sequence or collection of categorical values is associated with this row’s group?

Concept

A string aggregation window function concatenates values from multiple rows within a partition into a single textual representation, while preserving row-level granularity.

Unlike classic GROUP BY STRING_AGG:

Rows are not reduced
The aggregated string is attached to each row
Ordering controls how the string is built

This makes string aggregation window functions especially useful for behavioral tracing, auditability, and sequence analysis.

Common Functions in This Layer

STRING_AGG()

When combined with OVER(...), STRING_AGG() becomes a contextual, row-preserving operation.

Generic Pattern

STRING_AGG(column, delimiter)
OVER (
    PARTITION BY grouping_column
    ORDER BY ordering_column
)

delimiter defines how values are separated
ORDER BY defines the sequence of concatenation
The result grows as the window progresses

Example 1: Channels Used by Each Customer (Cumulative)

Business intuition: “What interaction channels has this customer used up to this point?”

. . .

SELECT
    sale_id,
    customer_id,
    sale_date,
    channel,
    STRING_AGG(channel, ', ') OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS channels_used_so_far
FROM tmp_sales;

\[\downarrow\]

sale_id	customer_id	sale_date	channel	channels_used_so_far
1	A	2024-01-01	online	online
2	A	2024-01-02	store	online, store
3	A	2024-01-03	online	online, store, online
4	A	2024-01-04	store	online, store, online, store
5	B	2024-01-01	store	store
6	B	2024-01-02	online	store, online
7	B	2024-01-03	online	store, online, online
8	B	2024-01-04	store	store, online, online, store
9	C	2024-01-01	online	online
10	C	2024-01-02	online	online, online
11	C	2024-01-03	online	online, online, online
12	D	2024-01-01	store	store
13	D	2024-01-02	store	store, store
14	E	2024-01-01	store	store
15	E	2024-01-02	online	store, online
16	E	2024-01-03	store	store, online, store

Example 2: Full Channel History per Customer

To attach the complete history to every row, use an explicit frame.

SELECT
    sale_id,
    customer_id,
    channel,
    STRING_AGG(channel, ', ') OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS full_channel_history
FROM tmp_sales;

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING:

UNBOUNDED PRECEDING → start from the first row in the partition
UNBOUNDED FOLLOWING → go until the last row in the partition

\[\downarrow\]

sale_id	customer_id	channel	full_channel_history
1	A	online	online, store, online, store
2	A	store	online, store, online, store
3	A	online	online, store, online, store
4	A	store	online, store, online, store
5	B	store	store, online, online, store
6	B	online	store, online, online, store
7	B	online	store, online, online, store
8	B	store	store, online, online, store
9	C	online	online, online, online
10	C	online	online, online, online
11	C	online	online, online, online
12	D	store	store, store
13	D	store	store, store
14	E	store	store, online, store
15	E	online	store, online, store
16	E	store	store, online, store

Example 3: Channel Combination Pattern 1

SELECT
    customer_id,
    STRING_AGG(channel, ', ' ORDER BY channel) AS channel_pattern
    
FROM tmp_sales
GROUP BY customer_id

\[\downarrow\]

customer_id	channel_pattern
A	online, online, store, store
B	online, online, store, store
C	online, online, online
D	store, store
E	online, store, store

Important

try without ORDER BY in STRING_AGG to see different results

Example 4: Channel Combination Pattern 2

SELECT
    customer_id,
    STRING_AGG(channel, ', ' ORDER BY channel) AS channel_pattern,
    STRING_AGG(DISTINCT channel, ', ' ORDER BY channel) AS channel_pattern_distinct
FROM tmp_sales
GROUP BY customer_id

\[\downarrow\]

customer_id	channel_pattern	channel_pattern_distinct
A	online, online, store, store	online, store
B	online, online, store, store	online, store
C	online, online, online	online
D	store, store	store
E	online, store, store	online, store

Example 5: Channel Combination Pattern Grouped

WITH customer_channels AS (
    SELECT
        customer_id,
        STRING_AGG(DISTINCT channel, ', ' ORDER BY channel) AS channel_pattern
    FROM tmp_sales
    GROUP BY customer_id
)
SELECT
    channel_pattern,
    COUNT(*) AS customer_count
FROM customer_channels
GROUP BY channel_pattern;

\[\downarrow\]

channel_pattern	customer_count
online	1
store	1
online, store	3

Summary of String Aggregation Window Functions

Ordering determines string growth
Without a frame, aggregation is cumulative
With a full frame, aggregation is static per group
Strings become contextual descriptors, not summaries

Practical Analytics Use Cases

Customer journey reconstruction
Channel or event sequence analysis
Audit trails and explainability
Feature generation for downstream models

When to Use String Aggregation Window Functions

When order matters in categorical data
When you need traceability, not just counts
When building interpretable analytical features
When collapsing rows would lose information

This layer completes the transition from numeric comparison to behavioral sequencing, setting the stage for advanced window framing and analytics patterns.

Window Function Layer 6: Window Framing & Advanced Window Control

Window framing controls which rows inside the partition are visible to the window function for each row.

They answer the question:

Exactly which subset of rows should be used when computing the window value for this row?

Concept

A window frame defines a moving or fixed subset of rows relative to the current row.

While:

PARTITION BY defines who belongs together
ORDER BY defines sequence

The frame defines how far backward and forward the window can see.

Without an explicit frame, SQL engines apply default framing rules, which can lead to unexpected results, especially for cumulative and navigational functions.

Frame Dimensions

A frame is defined using:

Frame type
Frame boundaries

ROWS | RANGE BETWEEN <start> AND <end>

Frame Types

`ROWS`

Frame is defined by physical row positions
Deterministic and predictable
Preferred for analytics

`RANGE`

Frame is defined by logical value ranges
Can include multiple rows with the same ordering value
More subtle and engine-dependent

Common Frame Boundaries

Boundary	Meaning
`UNBOUNDED PRECEDING`	From the first row in the partition
`CURRENT ROW`	The current row
`UNBOUNDED FOLLOWING`	Until the last row in the partition
`n PRECEDING`	n rows before
`n FOLLOWING`	n rows after

Default Frame (Important)

For most databases:

ORDER BY column

implicitly means:

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

This is not always what you want.

Example 1: Running Total (Explicit Frame)

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM tmp_sales;

\[\downarrow\]

sale_id	customer_id	sale_date	amount	running_total
1	A	2024-01-01	100	100
2	A	2024-01-02	120	220
3	A	2024-01-03	90	310
4	A	2024-01-04	130	440
5	B	2024-01-01	180	180
6	B	2024-01-02	200	380
7	B	2024-01-03	220	600
8	B	2024-01-04	200	800
9	C	2024-01-01	150	150
10	C	2024-01-02	150	300
11	C	2024-01-03	170	470
12	D	2024-01-01	90	90
13	D	2024-01-02	110	200
14	E	2024-01-01	140	140
15	E	2024-01-02	160	300
16	E	2024-01-03	155	455

Example 2: Full-Partition Aggregate (Stable Value)

SELECT
    sale_id,
    customer_id,
    amount,
    AVG(amount) OVER (
        PARTITION BY customer_id
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS full_partition_avg
FROM tmp_sales;

\[\downarrow\]

sale_id	customer_id	amount	full_partition_avg
1	A	100	110.000
2	A	120	110.000
3	A	90	110.000
4	A	130	110.000
5	B	180	200.000
6	B	200	200.000
7	B	220	200.000
8	B	200	200.000
9	C	150	156.667
10	C	150	156.667
11	C	170	156.667
12	D	90	100.000
13	D	110	100.000
14	E	140	151.667
15	E	160	151.667
16	E	155	151.667

Example 3: Moving Window (Last 2 Transactions)

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    AVG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
    ) AS moving_avg_2
FROM tmp_sales;

Note

Here we can us any aggregate function: SUM(), MAX()

\[\downarrow\]

sale_id	customer_id	sale_date	amount	moving_avg_2
1	A	2024-01-01	100	100.000
2	A	2024-01-02	120	110.000
3	A	2024-01-03	90	105.000
4	A	2024-01-04	130	110.000
5	B	2024-01-01	180	180.000
6	B	2024-01-02	200	190.000
7	B	2024-01-03	220	210.000
8	B	2024-01-04	200	210.000
9	C	2024-01-01	150	150.000
10	C	2024-01-02	150	150.000
11	C	2024-01-03	170	160.000
12	D	2024-01-01	90	90.000
13	D	2024-01-02	110	100.000
14	E	2024-01-01	140	140.000
15	E	2024-01-02	160	150.000
16	E	2024-01-03	155	157.500

Example 4: Forward-Looking Average

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    AVG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING
    ) AS forward_avg_2
FROM tmp_sales;

\[\downarrow\]

sale_id	customer_id	sale_date	amount	forward_avg_2
1	A	2024-01-01	100	110.0000000000000000
2	A	2024-01-02	120	105.0000000000000000
3	A	2024-01-03	90	110.0000000000000000
4	A	2024-01-04	130	130.0000000000000000
5	B	2024-01-01	180	190.0000000000000000
6	B	2024-01-02	200	210.0000000000000000
7	B	2024-01-03	220	210.0000000000000000
8	B	2024-01-04	200	200.0000000000000000
9	C	2024-01-01	150	150.0000000000000000
10	C	2024-01-02	150	160.0000000000000000
11	C	2024-01-03	170	170.0000000000000000
12	D	2024-01-01	90	100.0000000000000000
13	D	2024-01-02	110	110.0000000000000000
14	E	2024-01-01	140	150.0000000000000000
15	E	2024-01-02	160	157.5000000000000000
16	E	2024-01-03	155	155.0000000000000000

Example 5: Centered Moving Average (Previous + Current + Next)

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    AVG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) AS centered_avg_3
FROM tmp_sales;

\[\downarrow\]

sale_id	customer_id	sale_date	amount	centered_avg_3
1	A	2024-01-01	100	110.00
2	A	2024-01-02	120	103.33
3	A	2024-01-03	90	113.33
4	A	2024-01-04	130	110.00
5	B	2024-01-01	180	190.00
6	B	2024-01-02	200	200.00
7	B	2024-01-03	220	206.67
8	B	2024-01-04	200	210.00
9	C	2024-01-01	150	150.00
10	C	2024-01-02	150	156.67
11	C	2024-01-03	170	160.00
12	D	2024-01-01	90	100.00
13	D	2024-01-02	110	100.00
14	E	2024-01-01	140	150.00
15	E	2024-01-02	160	151.67
16	E	2024-01-03	155	157.50

Key Observations

Framing determines visibility, not grouping
Many subtle bugs come from implicit frames
Explicit frames improve:
- Correctness
- Readability
- Portability

When to Use Explicit Window Frames

Running totals and cumulative metrics
Moving averages and rolling windows
Navigational functions (FIRST_VALUE, LAST_VALUE)
Any time correctness depends on row visibility

This layer gives you full control over window behavior, completing the mental model of advanced analytical SQL.

In-Class Assignment: Window Functions on the Analytics Schema

Context

You are working with the analytics schema of a transactional system that includes customers, orders, products, geography, and time.

Your task is to answer analytical questions using window functions, without collapsing rows unless explicitly stated.

You are expected to rely on:

PARTITION BY
ORDER BY
Appropriate window functions from the layers discussed in class

Do not use subqueries or CTEs unless explicitly requested.

Task 1: Simple Aggregate Window Functions

For each order, compute the average order value of the customer across all their orders.

Use AVG() as a window function
Preserve one row per order

Expected concepts:

AVG() OVER (PARTITION BY customer_id)
Row-level enrichment with group context

Task 2: Statistical Window Functions

For each customer, determine the percent rank of each order based on order revenue.

Steps:

Compute order revenue as
\[\text{order\_revenue} = \sum (\text{quantity} \times \text{price})\]
Apply PERCENT_RANK() within each customer

Expected concepts:

Distribution within partitions
Relative positioning of rows

Task 3: Value-from-Another-Row Functions

For each customer, compute the difference between the current order and the previous order in time.

Orders must be compared chronologically
The first order per customer should return NULL for the difference

Expected concepts:

LAG()
Time-aware ordering
Change detection

Task 4: Ranking Functions

Within each city, rank customers by their total lifetime spend.

Requirements:

Customers with the same spend must receive the same rank
No gaps should appear in ranking

Expected concepts:

DENSE_RANK()
Partitioning by geography
Ranking based on aggregated metrics

Task 5: String Aggregation Window Functions

For each customer, construct a chronological history of product categories they have purchased.

Requirements:

Categories must be ordered by order date
The full history must appear on every row for that customer

Expected concepts:

STRING_AGG() with window framing
Ordered categorical sequencing
Behavioral traceability

Task 6: Channel Behavior Pattern (Advanced)

Create a result showing:

A channel usage pattern (e.g. online, store)
The number of customers exhibiting that pattern

Rules:

Do not sort channels alphabetically
Let the order emerge naturally from purchase behavior
Customers using only one channel must appear as well

Expected concepts:

String aggregation
Grouping by derived behavioral signatures
Sensitivity to ordering

Submission Guidelines

Write one query per task
Each query must be executable independently
Focus on correctness, not formatting
Prefer clarity over cleverness

This assignment evaluates your ability to reason across rows, not just aggregate them.

Window Functions

Why Window Functions Matter in Analytics

Window Function Types Layers

Window Syntax (Generic)

Create an Extended Dummy Temporary Table

Output

Window Function Layer 1: Simple Aggregate Window Functions

Concept

Common Functions in This Layer

Generic Pattern

Example 1: Average Amount per Customer

Example 2: Total Spend per Customer

Example 3: Transaction Count per Customer

Example 4: Minimum and Maximum Amount per Customer

Summary of Simple Aggregate Window Functions

When to Use Simple Aggregate Window Functions

Window Function Layer 2: Statistical Window Functions

Concept

Common Functions in This Layer

Generic Pattern

Example 1: Median Amount per Customer

Example 2: Percent Rank of Each Transaction

Example 3: Cumulative Distribution

Example 4: Quartile Assignment

Key Observations

When to Use Statistical Window Functions

Window Function Layer 3: Value-from-Another-Row Functions

Concept

Common Functions in This Layer

Generic Pattern

Example 1: Previous Transaction Amount (LAG)

Example 2: Next Transaction Amount (LEAD)

Example 3: Change Since Previous Transaction

Example 4: First Transaction Amount per Customer

Example 5: Last Transaction Amount per Customer

When to Use Value-from-Another-Row Functions

Window Function Layer 4: Ranking Functions

Concept

Common Functions in This Layer

Generic Pattern

Function Semantics

Example 1: Sequential Ordering (ROW_NUMBER)

Example 2: Dense Ranking by Amount (DENSE_RANK)

Example 3: Ranking by Amount with Gaps (RANK)

Example 4: Comparison

Summary of Ranking Functions

Practical Analytics Use Cases

When to Use Ranking Functions

Window Function Layer 5: String Aggregation Window Functions

Concept

Common Functions in This Layer

Generic Pattern

Example 1: Channels Used by Each Customer (Cumulative)

Example 2: Full Channel History per Customer

Example 3: Channel Combination Pattern 1

Example 4: Channel Combination Pattern 2

Example 5: Channel Combination Pattern Grouped

Summary of String Aggregation Window Functions

Practical Analytics Use Cases

When to Use String Aggregation Window Functions

Window Function Layer 6: Window Framing & Advanced Window Control

Concept

Frame Dimensions

Frame Types

ROWS

RANGE

Common Frame Boundaries

Default Frame (Important)

Example 1: Running Total (Explicit Frame)

Example 2: Full-Partition Aggregate (Stable Value)

Example 3: Moving Window (Last 2 Transactions)

Example 4: Forward-Looking Average

Example 5: Centered Moving Average (Previous + Current + Next)

Key Observations

When to Use Explicit Window Frames

In-Class Assignment: Window Functions on the Analytics Schema

Context

Task 1: Simple Aggregate Window Functions

Task 2: Statistical Window Functions

Task 3: Value-from-Another-Row Functions

Example 1: Previous Transaction Amount (`LAG`)

Example 2: Next Transaction Amount (`LEAD`)

Example 1: Sequential Ordering (`ROW_NUMBER`)

Example 2: Dense Ranking by Amount (`DENSE_RANK`)

Example 3: Ranking by Amount with Gaps (`RANK`)

`ROWS`

`RANGE`