Session 10: DA with SQL | Advanced SQL Functions
Window Functions
A window function is a SQL function that performs a calculation over a set of rows related to the current row, called a window, without collapsing rows.
This is the defining property:
A window function adds information to each row by looking at other rows.
Formally, a window function:
- Operates over a window of rows defined by
OVER(...) - Preserves the original number of rows
- Attaches the computed value back to each row
This distinguishes window functions from:
GROUP BYaggregates (which reduce rows)- Subqueries and CTEs (which restructure data)
Mental model:
Each row keeps its identity, but gains context.
Why Window Functions Matter in Analytics
Analytical questions often require comparison, not just aggregation.
Examples:
- How does this order compare to the customer’s average?
- What changed since the previous transaction?
- What is the rank of this event within a group?
- How does this value evolve over time?
Window functions are designed precisely for these tasks.
Window Function Types Layers
- Aggregate over a window (e.g.,
SUM(),AVG(),COUNT())
- Value from another row (e.g.,
LAG(),LEAD())
- Ranking functions (e.g.,
ROW_NUMBER(),RANK(),DENSE_RANK())
- Cumulative distributions (e.g.,
CUME_DIST(),PERCENT_RANK())
- String aggregation (e.g.,
STRING_AGG()) - Navigational functions (e.g.,
FIRST_VALUE(),LAST_VALUE()) - Framing & Advanced Window Control (Advanced)
Window Syntax (Generic)
function_name(expression)
OVER (
PARTITION BY ...
ORDER BY ...
)PARTITION BYdefines who is compared with whomORDER BYdefines sequence inside the window- The window determines which rows are visible
Create an Extended Dummy Temporary Table
Before applying window functions to the analytics schema, we use a controlled temporary table with:
- Multiple entities
- Time ordering
- Repeated values (for ranking)
- A categorical column (for
STRING_AGG)
CREATE TEMP TABLE tmp_sales AS
SELECT *
FROM (
VALUES
(1, 'A', DATE '2024-01-01', 100, 'online'),
(2, 'A', DATE '2024-01-02', 120, 'store'),
(3, 'A', DATE '2024-01-03', 90, 'online'),
(4, 'A', DATE '2024-01-04', 130, 'store'),
(5, 'B', DATE '2024-01-01', 180, 'store'),
(6, 'B', DATE '2024-01-02', 200, 'online'),
(7, 'B', DATE '2024-01-03', 220, 'online'),
(8, 'B', DATE '2024-01-04', 200, 'store'),
(9, 'C', DATE '2024-01-01', 150, 'online'),
(10, 'C', DATE '2024-01-02', 150, 'online'),
(11, 'C', DATE '2024-01-03', 170, 'online'),
(12, 'D', DATE '2024-01-01', 90, 'store'),
(13, 'D', DATE '2024-01-02', 110, 'store'),
(14, 'E', DATE '2024-01-01', 140, 'store'),
(15, 'E', DATE '2024-01-02', 160, 'online'),
(16, 'E', DATE '2024-01-03', 155, 'store')
) AS t(
sale_id,
customer_id,
sale_date,
amount,
channel
);Output
| sale_id | customer_id | sale_date | amount | channel |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | online |
| 2 | A | 2024-01-02 | 120 | store |
| 3 | A | 2024-01-03 | 90 | online |
| 4 | A | 2024-01-04 | 130 | store |
| 5 | B | 2024-01-01 | 180 | store |
| 6 | B | 2024-01-02 | 200 | online |
| 7 | B | 2024-01-03 | 220 | online |
| 8 | B | 2024-01-04 | 200 | store |
| 9 | C | 2024-01-01 | 150 | online |
| 10 | C | 2024-01-02 | 150 | online |
| 11 | C | 2024-01-03 | 170 | online |
| 12 | D | 2024-01-01 | 90 | store |
| 13 | D | 2024-01-02 | 110 | store |
| 14 | E | 2024-01-01 | 140 | store |
| 15 | E | 2024-01-02 | 160 | online |
| 16 | E | 2024-01-03 | 155 | store |
Window Function Layer 1: Simple Aggregate Window Functions
Simple aggregate window functions apply classical aggregation logic (such as SUM, AVG, COUNT, MIN, MAX) without collapsing rows.
They answer the question:
How does this row relate to an aggregate of its group/partition?
Concept
A simple aggregate window function computes an aggregate value over a partition of rows and attaches that value to every row in the partition.
Unlike GROUP BY:
- Rows are not reduced
- Each row keeps its original granularity
- Aggregates become contextual attributes
Compute a group-level metric, then stamp it onto every row in the group.
Common Functions in This Layer
SUM()
AVG()
COUNT()
MIN()
MAX()
All of them become window functions when combined with OVER(...).
Generic Pattern
AGGREGATE_FUNCTION(column)
OVER (
PARTITION BY grouping_column
)Example 1: Average Amount per Customer
SELECT
sale_id,
customer_id,
sale_date,
amount,
AVG(amount) OVER (PARTITION BY customer_id) AS avg_customer_amount
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | sale_date | amount | avg_customer_amount |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | 110.000 |
| 2 | A | 2024-01-02 | 120 | 110.000 |
| 3 | A | 2024-01-03 | 90 | 110.000 |
| 4 | A | 2024-01-04 | 130 | 110.000 |
| 5 | B | 2024-01-01 | 180 | 200.000 |
| 6 | B | 2024-01-02 | 200 | 200.000 |
| 7 | B | 2024-01-03 | 220 | 200.000 |
| 8 | B | 2024-01-04 | 200 | 200.000 |
| 9 | C | 2024-01-01 | 150 | 156.667 |
| 10 | C | 2024-01-02 | 150 | 156.667 |
| 11 | C | 2024-01-03 | 170 | 156.667 |
| 12 | D | 2024-01-01 | 90 | 100.000 |
| 13 | D | 2024-01-02 | 110 | 100.000 |
| 14 | E | 2024-01-01 | 140 | 151.667 |
| 15 | E | 2024-01-02 | 160 | 151.667 |
| 16 | E | 2024-01-03 | 155 | 151.667 |
\(\downarrow\)
Customer A
Amounts: 100 120 90 130
Average: (100 + 120 + 90 + 130) / 4 = 110
Each row sees ALL rows in partition:
[100, 120, 90, 130] → 110
[100, 120, 90, 130] → 110
[100, 120, 90, 130] → 110
[100, 120, 90, 130] → 110Example 2: Total Spend per Customer
SELECT
sale_id,
customer_id,
amount,
SUM(amount) OVER (
PARTITION BY customer_id
) AS total_customer_spend
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | amount | total_customer_spend |
|---|---|---|---|
| 1 | A | 100 | 440.000 |
| 2 | A | 120 | 440.000 |
| 3 | A | 90 | 440.000 |
| 4 | A | 130 | 440.000 |
| 5 | B | 180 | 800.000 |
| 6 | B | 200 | 800.000 |
| 7 | B | 220 | 800.000 |
| 8 | B | 200 | 800.000 |
| 9 | C | 150 | 470.000 |
| 10 | C | 150 | 470.000 |
| 11 | C | 170 | 470.000 |
| 12 | D | 90 | 200.000 |
| 13 | D | 110 | 200.000 |
| 14 | E | 140 | 455.000 |
| 15 | E | 160 | 455.000 |
| 16 | E | 155 | 455.000 |
\(\downarrow\)
Customer A
Amounts: 100 120 90 130
Total: 100 + 120 + 90 + 130 = 440
Each row sees ALL rows in partition:
[100, 120, 90, 130] → 440
[100, 120, 90, 130] → 440
[100, 120, 90, 130] → 440
[100, 120, 90, 130] → 440Example 3: Transaction Count per Customer
SELECT
sale_id,
customer_id,
COUNT(*) OVER (
PARTITION BY customer_id
) AS transaction_count
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | transaction_count |
|---|---|---|
| 1 | A | 4 |
| 2 | A | 4 |
| 3 | A | 4 |
| 4 | A | 4 |
| 5 | B | 4 |
| 6 | B | 4 |
| 7 | B | 4 |
| 8 | B | 4 |
| 9 | C | 3 |
| 10 | C | 3 |
| 11 | C | 3 |
| 12 | D | 2 |
| 13 | D | 2 |
| 14 | E | 3 |
| 15 | E | 3 |
| 16 | E | 3 |
\(\downarrow\)
Customer A
Transactions: 4 rows
Each row sees ALL rows in partition:
[1, 2, 3, 4] → 4
[1, 2, 3, 4] → 4
[1, 2, 3, 4] → 4
[1, 2, 3, 4] → 4Example 4: Minimum and Maximum Amount per Customer
SELECT
sale_id,
customer_id,
amount,
MIN(amount) OVER (
PARTITION BY customer_id
) AS min_amount,
MAX(amount) OVER (
PARTITION BY customer_id
) AS max_amount
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | amount | min_amount | max_amount |
|---|---|---|---|---|
| 1 | A | 100 | 90.000 | 130.000 |
| 2 | A | 120 | 90.000 | 130.000 |
| 3 | A | 90 | 90.000 | 130.000 |
| 4 | A | 130 | 90.000 | 130.000 |
| 5 | B | 180 | 180.000 | 220.000 |
| 6 | B | 200 | 180.000 | 220.000 |
| 7 | B | 220 | 180.000 | 220.000 |
| 8 | B | 200 | 180.000 | 220.000 |
| 9 | C | 150 | 150.000 | 170.000 |
| 10 | C | 150 | 150.000 | 170.000 |
| 11 | C | 170 | 150.000 | 170.000 |
| 12 | D | 90 | 90.000 | 110.000 |
| 13 | D | 110 | 90.000 | 110.000 |
| 14 | E | 140 | 140.000 | 160.000 |
| 15 | E | 160 | 140.000 | 160.000 |
| 16 | E | 155 | 140.000 | 160.000 |
\(\downarrow\)
Customer A
Amounts: 100 120 90 130
Min: 90
Max: 130
Each row sees ALL rows in partition:
[100, 120, 90, 130] → min=90, max=130
[100, 120, 90, 130] → min=90, max=130
[100, 120, 90, 130] → min=90, max=130
[100, 120, 90, 130] → min=90, max=130Summary of Simple Aggregate Window Functions
- Aggregation logic is identical to
GROUP BY - The difference is row preservation
- These functions create reference baselines
- They are often the foundation for:
- Variance analysis
- Normalization
- Z-scores
- Behavioral comparisons
- Variance analysis
When to Use Simple Aggregate Window Functions
- Comparing a row to its group average
- Attaching group-level metrics to granular data
- Preparing features for modeling
- Avoiding joins or subqueries for simple enrichments
This layer forms the foundation for all advanced window logic.
Window Function Layer 2: Statistical Window Functions
Statistical window functions extend simple aggregates by answering distributional questions instead of basic totals or averages.
They answer the question:
Where does this row stand within the distribution of its group?
Concept
A statistical window function computes position, spread, or relative standing of a row within a partition.
Unlike simple aggregates:
- They focus on ordering and distribution
- They often require
ORDER BY - Results depend on relative position, not just totals
These functions are essential for exploratory data analysis, outlier detection, and ranking-based insights.
Common Functions in This Layer
PERCENTILE_CONT()
PERCENTILE_DISC()
MEDIAN()
PERCENT_RANK()
CUME_DIST()
Generic Pattern
STAT_FUNCTION(...)
OVER (
PARTITION BY grouping_column
ORDER BY ordering_column
)Example 1: Median Amount per Customer
SELECT
customer_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_amount
FROM tmp_sales
GROUP BY customer_id;\(\downarrow\)
| customer_id | median_amount |
|---|---|
| A | 110.000 |
| B | 200.000 |
| C | 150.000 |
| D | 100.000 |
| E | 155.000 |
\(\downarrow\)
Customer A
Sorted amounts: 90 100 120 130
Median = (100 + 120) / 2 = 110
Customer C
Sorted amounts: 150 150 170
Median = 150Example 2: Percent Rank of Each Transaction
SELECT
sale_id,
customer_id,
amount,
PERCENT_RANK() OVER (
PARTITION BY customer_id
ORDER BY amount
) AS percent_rank
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | amount | percent_rank |
|---|---|---|---|
| 3 | A | 90 | 0.000 |
| 1 | A | 100 | 0.333 |
| 2 | A | 120 | 0.667 |
| 4 | A | 130 | 1.000 |
| 5 | B | 180 | 0.000 |
| 6 | B | 200 | 0.333 |
| 8 | B | 200 | 0.333 |
| 7 | B | 220 | 1.000 |
| 9 | C | 150 | 0.000 |
| 10 | C | 150 | 0.000 |
| 11 | C | 170 | 1.000 |
\(\downarrow\)
Customer A
Sorted: 90 100 120 130
Position formula:
(rank - 1) / (n - 1)
90 → (1-1)/3 = 0
100 → (2-1)/3 = 0.33
120 → (3-1)/3 = 0.67
130 → (4-1)/3 = 1Example 3: Cumulative Distribution
SELECT
sale_id,
customer_id,
amount,
CUME_DIST() OVER (
PARTITION BY customer_id
ORDER BY amount
) AS cumulative_distribution
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | amount | cumulative_distribution |
|---|---|---|---|
| 3 | A | 90 | 0.250 |
| 1 | A | 100 | 0.500 |
| 2 | A | 120 | 0.750 |
| 4 | A | 130 | 1.000 |
| 5 | B | 180 | 0.250 |
| 6 | B | 200 | 0.750 |
| 8 | B | 200 | 0.750 |
| 7 | B | 220 | 1.000 |
\(\downarrow\)
Customer A
Sorted: 90 100 120 130
CUME_DIST = (# values ≤ current) / total
90 → 1/4 = 0.25
100 → 2/4 = 0.50
120 → 3/4 = 0.75
130 → 4/4 = 1Example 4: Quartile Assignment
SELECT
sale_id,
customer_id,
amount,
NTILE(4) OVER (
PARTITION BY customer_id
ORDER BY amount
) AS quartile
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | amount | quartile |
|---|---|---|---|
| 3 | A | 90 | 1 |
| 1 | A | 100 | 2 |
| 2 | A | 120 | 3 |
| 4 | A | 130 | 4 |
| 5 | B | 180 | 1 |
| 6 | B | 200 | 2 |
| 8 | B | 200 | 3 |
| 7 | B | 220 | 4 |
\(\downarrow\)
Customer A
Sorted: 90 100 120 130
Split into 4 equal buckets:
90 → Q1
100 → Q2
120 → Q3
130 → Q4Key Observations
- Statistical window functions are order-sensitive
- They expose distribution shape
- Results are relative, not absolute
- They are critical for:
- Outlier detection
- Segmentation
- Threshold-based decisions
- Outlier detection
When to Use Statistical Window Functions
- Understanding customer behavior distributions
- Identifying top/bottom performers
- Creating quantile-based segments
- Normalizing values for comparison
This layer bridges descriptive statistics and analytical SQL, preparing the ground for ranking and time-aware functions.
Window Function Layer 3: Value-from-Another-Row Functions
Value-from-another-row window functions allow a row to access values from neighboring rows within the same partition.
They answer the question:
How does this row relate to previous or next rows in sequence?
Concept
A value-from-another-row window function retrieves a value from a different row in the same window, based on ordering.
These functions are inherently time-aware and sequence-aware.
Unlike aggregate or statistical functions:
- They do not summarize a group
- They compare a row to specific neighboring rows
- Ordering is mandatory
Common Functions in This Layer
LAG()
LEAD()
FIRST_VALUE()
LAST_VALUE()
Generic Pattern
VALUE_FUNCTION(column [, offset])
OVER (
PARTITION BY grouping_column
ORDER BY ordering_column
)offset(optional) defines how many rows forward/backward to look- Default offset is
1
Example 1: Previous Transaction Amount (LAG)
Business intuition:
“How did this transaction change compared to the previous one?”
SELECT
sale_id,
customer_id,
sale_date,
amount,
LAG(amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
) AS previous_amount
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | sale_date | amount | previous_amount |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | |
| 2 | A | 2024-01-02 | 120 | 100 |
| 3 | A | 2024-01-03 | 90 | 120 |
| 4 | A | 2024-01-04 | 130 | 90 |
| 5 | B | 2024-01-01 | 180 | |
| 6 | B | 2024-01-02 | 200 | 180 |
| 7 | B | 2024-01-03 | 220 | 200 |
| 8 | B | 2024-01-04 | 200 | 220 |
| 9 | C | 2024-01-01 | 150 | |
| 10 | C | 2024-01-02 | 150 | 150 |
| 11 | C | 2024-01-03 | 170 | 150 |
| 12 | D | 2024-01-01 | 90 | |
| 13 | D | 2024-01-02 | 110 | 90 |
| 14 | E | 2024-01-01 | 140 | |
| 15 | E | 2024-01-02 | 160 | 140 |
| 16 | E | 2024-01-03 | 155 | 160 |
\(\downarrow\)
Customer A
Transactions by date: 100 120 90 130
Current row looks 1 row backward:
100 → previous = NULL
120 → previous = 100
90 → previous = 120
130 → previous = 90Example 2: Next Transaction Amount (LEAD)
SELECT
sale_id,
customer_id,
sale_date,
amount,
LEAD(amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
) AS next_amount
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | sale_date | amount | next_amount |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | 120 |
| 2 | A | 2024-01-02 | 120 | 90 |
| 3 | A | 2024-01-03 | 90 | 130 |
| 4 | A | 2024-01-04 | 130 | |
| 5 | B | 2024-01-01 | 180 | 200 |
| 6 | B | 2024-01-02 | 200 | 220 |
| 7 | B | 2024-01-03 | 220 | 200 |
| 8 | B | 2024-01-04 | 200 | |
| 9 | C | 2024-01-01 | 150 | 150 |
| 10 | C | 2024-01-02 | 150 | 170 |
| 11 | C | 2024-01-03 | 170 | |
| 12 | D | 2024-01-01 | 90 | 110 |
| 13 | D | 2024-01-02 | 110 | |
| 14 | E | 2024-01-01 | 140 | 160 |
| 15 | E | 2024-01-02 | 160 | 155 |
| 16 | E | 2024-01-03 | 155 |
\(\downarrow\)
Customer A
Transactions by date: 100 120 90 130
Current row looks 1 row forward:
100 → next = 120
120 → next = 90
90 → next = 130
130 → next = NULLExample 3: Change Since Previous Transaction
SELECT
sale_id,
customer_id,
sale_date,
amount,
amount
- LAG(amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
) AS amount_change
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | sale_date | amount | amount_change |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | |
| 2 | A | 2024-01-02 | 120 | 20 |
| 3 | A | 2024-01-03 | 90 | -30 |
| 4 | A | 2024-01-04 | 130 | 40 |
| 5 | B | 2024-01-01 | 180 | |
| 6 | B | 2024-01-02 | 200 | 20 |
| 7 | B | 2024-01-03 | 220 | 20 |
| 8 | B | 2024-01-04 | 200 | -20 |
| 9 | C | 2024-01-01 | 150 | |
| 10 | C | 2024-01-02 | 150 | 0 |
| 11 | C | 2024-01-03 | 170 | 20 |
| 12 | D | 2024-01-01 | 90 | |
| 13 | D | 2024-01-02 | 110 | 20 |
| 14 | E | 2024-01-01 | 140 | |
| 15 | E | 2024-01-02 | 160 | 20 |
| 16 | E | 2024-01-03 | 155 | -5 |
\(\downarrow\)
Customer A
Transactions by date: 100 120 90 130
Formula:
current amount - previous amount
100 → NULL
120 - 100 = 20
90 - 120 = -30
130 - 90 = 40Example 4: First Transaction Amount per Customer
SELECT
sale_id,
customer_id,
sale_date,
amount,
FIRST_VALUE(amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
) AS first_amount
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | sale_date | amount | first_amount |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | 100 |
| 2 | A | 2024-01-02 | 120 | 100 |
| 3 | A | 2024-01-03 | 90 | 100 |
| 4 | A | 2024-01-04 | 130 | 100 |
| 5 | B | 2024-01-01 | 180 | 180 |
| 6 | B | 2024-01-02 | 200 | 180 |
| 7 | B | 2024-01-03 | 220 | 180 |
| 8 | B | 2024-01-04 | 200 | 180 |
| 9 | C | 2024-01-01 | 150 | 150 |
| 10 | C | 2024-01-02 | 150 | 150 |
| 11 | C | 2024-01-03 | 170 | 150 |
| 12 | D | 2024-01-01 | 90 | 90 |
| 13 | D | 2024-01-02 | 110 | 90 |
| 14 | E | 2024-01-01 | 140 | 140 |
| 15 | E | 2024-01-02 | 160 | 140 |
| 16 | E | 2024-01-03 | 155 | 140 |
\(\downarrow\)
Customer A
Transactions by date: 100 120 90 130
First row in partition = 100
100 → 100
120 → 100
90 → 100
130 → 100Example 5: Last Transaction Amount per Customer
SELECT
sale_id,
customer_id,
sale_date,
amount,
LAST_VALUE(amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_amount
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | sale_date | amount | last_amount |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | 130 |
| 2 | A | 2024-01-02 | 120 | 130 |
| 3 | A | 2024-01-03 | 90 | 130 |
| 4 | A | 2024-01-04 | 130 | 130 |
| 5 | B | 2024-01-01 | 180 | 200 |
| 6 | B | 2024-01-02 | 200 | 200 |
| 7 | B | 2024-01-03 | 220 | 200 |
| 8 | B | 2024-01-04 | 200 | 200 |
| 9 | C | 2024-01-01 | 150 | 170 |
| 10 | C | 2024-01-02 | 150 | 170 |
| 11 | C | 2024-01-03 | 170 | 170 |
| 12 | D | 2024-01-01 | 90 | 110 |
| 13 | D | 2024-01-02 | 110 | 110 |
| 14 | E | 2024-01-01 | 140 | 155 |
| 15 | E | 2024-01-02 | 160 | 155 |
| 16 | E | 2024-01-03 | 155 | 155 |
\(\downarrow\)
Customer A
Transactions by date: 100 120 90 130
Last row in partition = 130
100 → 130
120 → 130
90 → 130
130 → 130- Ordering is mandatory
- These functions are directional
- They enable:
- Trend analysis
- Change detection
- Time-series feature engineering
- Trend analysis
When to Use Value-from-Another-Row Functions
- Measuring growth or decline
- Detecting behavioral changes
- Comparing events over time
- Preparing lagged features for models
This layer introduces temporal reasoning, which is fundamental for advanced analytics and forecasting.
Window Function Layer 4: Ranking Functions
Ranking window functions assign ordinal positions to rows within a partition based on a specified ordering.
They answer the question:
How does this row rank relative to other rows in its group?
Concept
A ranking window function assigns a numeric rank to each row according to an ORDER BY rule inside a partition.
Unlike statistical functions:
- They produce discrete ranks, not continuous measures
- They are sensitive to ties
- They are commonly used for top-N analysis and leaderboards
Common Functions in This Layer
ROW_NUMBER()
RANK()
DENSE_RANK()
Each function handles ties differently.
Generic Pattern
RANKING_FUNCTION()
OVER (
PARTITION BY grouping_column
ORDER BY ordering_column [ASC | DESC]
)Function Semantics
| Function | Handles Ties | Gaps in Ranking | Typical Use |
|---|---|---|---|
ROW_NUMBER() |
No | Yes | Unique ordering |
RANK() |
Yes | Yes | Competition ranking |
DENSE_RANK() |
Yes | No | Tier-based ranking |
Example 1: Sequential Ordering (ROW_NUMBER)
Business intuition:
“What is the chronological position of each transaction?”
SELECT
sale_id,
customer_id,
sale_date,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY sale_date
) AS row_number
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | sale_date | row_number |
|---|---|---|---|
| 1 | A | 2024-01-01 | 1 |
| 2 | A | 2024-01-02 | 2 |
| 3 | A | 2024-01-03 | 3 |
| 4 | A | 2024-01-04 | 4 |
| 5 | B | 2024-01-01 | 1 |
| 6 | B | 2024-01-02 | 2 |
| 7 | B | 2024-01-03 | 3 |
| 8 | B | 2024-01-04 | 4 |
| 9 | C | 2024-01-01 | 1 |
| 10 | C | 2024-01-02 | 2 |
| 11 | C | 2024-01-03 | 3 |
| 12 | D | 2024-01-01 | 1 |
| 13 | D | 2024-01-02 | 2 |
| 14 | E | 2024-01-01 | 1 |
| 15 | E | 2024-01-02 | 2 |
| 16 | E | 2024-01-03 | 3 |
Example 2: Dense Ranking by Amount (DENSE_RANK)
SELECT
sale_id,
customer_id,
amount,
DENSE_RANK() OVER (
PARTITION BY customer_id
ORDER BY amount DESC
) AS dense_rank_amount
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | amount | dense_rank_amount |
|---|---|---|---|
| 4 | A | 130 | 1 |
| 2 | A | 120 | 2 |
| 1 | A | 100 | 3 |
| 3 | A | 90 | 4 |
| 7 | B | 220 | 1 |
| 6 | B | 200 | 2 |
| 8 | B | 200 | 2 |
| 5 | B | 180 | 3 |
| 11 | C | 170 | 1 |
| 9 | C | 150 | 2 |
| 10 | C | 150 | 2 |
| 13 | D | 110 | 1 |
| 12 | D | 90 | 2 |
| 15 | E | 160 | 1 |
| 16 | E | 155 | 2 |
| 14 | E | 140 | 3 |
Example 3: Ranking by Amount with Gaps (RANK)
Business intuition: “What is the spending rank of each transaction?”
SELECT
sale_id,
customer_id,
amount,
RANK() OVER (
PARTITION BY customer_id
ORDER BY amount DESC
) AS rank_amount
FROM tmp_sales;\(\downarrow\)
| sale_id | customer_id | amount | rank_amount |
|---|---|---|---|
| 4 | A | 130 | 1 |
| 2 | A | 120 | 2 |
| 1 | A | 100 | 3 |
| 3 | A | 90 | 4 |
| 7 | B | 220 | 1 |
| 6 | B | 200 | 2 |
| 8 | B | 200 | 2 |
| 5 | B | 180 | 4 |
| 11 | C | 170 | 1 |
| 9 | C | 150 | 2 |
| 10 | C | 150 | 2 |
| 13 | D | 110 | 1 |
| 12 | D | 90 | 2 |
| 15 | E | 160 | 1 |
| 16 | E | 155 | 2 |
| 14 | E | 140 | 3 |
Example 4: Comparison
SELECT
sale_id,
customer_id,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY amount DESC
) AS row_number_amount,
RANK() OVER (
PARTITION BY customer_id
ORDER BY amount DESC
) AS rank_amount,
DENSE_RANK() OVER (
PARTITION BY customer_id
ORDER BY amount DESC
) AS dense_rank_amount
FROM tmp_sales
ORDER BY customer_id, amount DESC, sale_id;\(\downarrow\)
| sale_id | customer_id | amount | row_number_amount | rank_amount | dense_rank_amount |
|---|---|---|---|---|---|
| 4 | A | 130 | 1 | 1 | 1 |
| 2 | A | 120 | 2 | 2 | 2 |
| 1 | A | 100 | 3 | 3 | 3 |
| 3 | A | 90 | 4 | 4 | 4 |
| 7 | B | 220 | 1 | 1 | 1 |
| 6 | B | 200 | 2 | 2 | 2 |
| 8 | B | 200 | 3 | 2 | 2 |
| 5 | B | 180 | 4 | 4 | 3 |
| 11 | C | 170 | 1 | 1 | 1 |
| 9 | C | 150 | 2 | 2 | 2 |
| 10 | C | 150 | 3 | 2 | 2 |
| 13 | D | 110 | 1 | 1 | 1 |
| 12 | D | 90 | 2 | 2 | 2 |
| 15 | E | 160 | 1 | 1 | 1 |
| 16 | E | 155 | 2 | 2 | 2 |
| 14 | E | 140 | 3 | 3 | 3 |
Summary of Ranking Functions
- Ranking always requires
ORDER BY ROW_NUMBER()enforces uniquenessRANK()leaves gaps after tiesDENSE_RANK()preserves continuity
Practical Analytics Use Cases
- Top-N customers or transactions
- Identifying best/worst performers
- De-duplication with deterministic ordering
- Feature engineering for modeling
When to Use Ranking Functions
- Creating leaderboards
- Selecting first / top records per group
- Resolving duplicates deterministically
- Segmenting by relative performance
Ranking functions introduce comparative structure, preparing the ground for cumulative and navigational window logic.
Window Function Layer 5: String Aggregation Window Functions
String aggregation window functions allow rows to accumulate textual values from related rows without collapsing the dataset.
They answer the question:
What sequence or collection of categorical values is associated with this row’s group?
Concept
A string aggregation window function concatenates values from multiple rows within a partition into a single textual representation, while preserving row-level granularity.
Unlike classic GROUP BY STRING_AGG:
- Rows are not reduced
- The aggregated string is attached to each row
- Ordering controls how the string is built
This makes string aggregation window functions especially useful for behavioral tracing, auditability, and sequence analysis.
Common Functions in This Layer
STRING_AGG()
When combined with OVER(...), STRING_AGG() becomes a contextual, row-preserving operation.
Generic Pattern
STRING_AGG(column, delimiter)
OVER (
PARTITION BY grouping_column
ORDER BY ordering_column
)delimiterdefines how values are separatedORDER BYdefines the sequence of concatenation- The result grows as the window progresses
Example 1: Channels Used by Each Customer (Cumulative)
Business intuition: “What interaction channels has this customer used up to this point?”
. . .
SELECT
sale_id,
customer_id,
sale_date,
channel,
STRING_AGG(channel, ', ') OVER (
PARTITION BY customer_id
ORDER BY sale_date
) AS channels_used_so_far
FROM tmp_sales;\[\downarrow\]
| sale_id | customer_id | sale_date | channel | channels_used_so_far |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | online | online |
| 2 | A | 2024-01-02 | store | online, store |
| 3 | A | 2024-01-03 | online | online, store, online |
| 4 | A | 2024-01-04 | store | online, store, online, store |
| 5 | B | 2024-01-01 | store | store |
| 6 | B | 2024-01-02 | online | store, online |
| 7 | B | 2024-01-03 | online | store, online, online |
| 8 | B | 2024-01-04 | store | store, online, online, store |
| 9 | C | 2024-01-01 | online | online |
| 10 | C | 2024-01-02 | online | online, online |
| 11 | C | 2024-01-03 | online | online, online, online |
| 12 | D | 2024-01-01 | store | store |
| 13 | D | 2024-01-02 | store | store, store |
| 14 | E | 2024-01-01 | store | store |
| 15 | E | 2024-01-02 | online | store, online |
| 16 | E | 2024-01-03 | store | store, online, store |
Example 2: Full Channel History per Customer
To attach the complete history to every row, use an explicit frame.
SELECT
sale_id,
customer_id,
channel,
STRING_AGG(channel, ', ') OVER (
PARTITION BY customer_id
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS full_channel_history
FROM tmp_sales;ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING:
UNBOUNDED PRECEDING→ start from the first row in the partitionUNBOUNDED FOLLOWING→ go until the last row in the partition
\[\downarrow\]
| sale_id | customer_id | channel | full_channel_history |
|---|---|---|---|
| 1 | A | online | online, store, online, store |
| 2 | A | store | online, store, online, store |
| 3 | A | online | online, store, online, store |
| 4 | A | store | online, store, online, store |
| 5 | B | store | store, online, online, store |
| 6 | B | online | store, online, online, store |
| 7 | B | online | store, online, online, store |
| 8 | B | store | store, online, online, store |
| 9 | C | online | online, online, online |
| 10 | C | online | online, online, online |
| 11 | C | online | online, online, online |
| 12 | D | store | store, store |
| 13 | D | store | store, store |
| 14 | E | store | store, online, store |
| 15 | E | online | store, online, store |
| 16 | E | store | store, online, store |
Example 3: Channel Combination Pattern 1
SELECT
customer_id,
STRING_AGG(channel, ', ' ORDER BY channel) AS channel_pattern
FROM tmp_sales
GROUP BY customer_id\[\downarrow\]
| customer_id | channel_pattern |
|---|---|
| A | online, online, store, store |
| B | online, online, store, store |
| C | online, online, online |
| D | store, store |
| E | online, store, store |
try without ORDER BY in STRING_AGG to see different results
Example 4: Channel Combination Pattern 2
SELECT
customer_id,
STRING_AGG(channel, ', ' ORDER BY channel) AS channel_pattern,
STRING_AGG(DISTINCT channel, ', ' ORDER BY channel) AS channel_pattern_distinct
FROM tmp_sales
GROUP BY customer_id\[\downarrow\]
| customer_id | channel_pattern | channel_pattern_distinct |
|---|---|---|
| A | online, online, store, store | online, store |
| B | online, online, store, store | online, store |
| C | online, online, online | online |
| D | store, store | store |
| E | online, store, store | online, store |
Example 5: Channel Combination Pattern Grouped
WITH customer_channels AS (
SELECT
customer_id,
STRING_AGG(DISTINCT channel, ', ' ORDER BY channel) AS channel_pattern
FROM tmp_sales
GROUP BY customer_id
)
SELECT
channel_pattern,
COUNT(*) AS customer_count
FROM customer_channels
GROUP BY channel_pattern;\[\downarrow\]
| channel_pattern | customer_count |
|---|---|
| online | 1 |
| store | 1 |
| online, store | 3 |
Summary of String Aggregation Window Functions
- Ordering determines string growth
- Without a frame, aggregation is cumulative
- With a full frame, aggregation is static per group
- Strings become contextual descriptors, not summaries
Practical Analytics Use Cases
- Customer journey reconstruction
- Channel or event sequence analysis
- Audit trails and explainability
- Feature generation for downstream models
When to Use String Aggregation Window Functions
- When order matters in categorical data
- When you need traceability, not just counts
- When building interpretable analytical features
- When collapsing rows would lose information
This layer completes the transition from numeric comparison to behavioral sequencing, setting the stage for advanced window framing and analytics patterns.
Window Function Layer 6: Window Framing & Advanced Window Control
Window framing controls which rows inside the partition are visible to the window function for each row.
They answer the question:
Exactly which subset of rows should be used when computing the window value for this row?
Concept
A window frame defines a moving or fixed subset of rows relative to the current row.
While:
PARTITION BYdefines who belongs togetherORDER BYdefines sequence
The frame defines how far backward and forward the window can see.
Without an explicit frame, SQL engines apply default framing rules, which can lead to unexpected results, especially for cumulative and navigational functions.
Frame Dimensions
A frame is defined using:
- Frame type
- Frame boundaries
ROWS | RANGE BETWEEN <start> AND <end>Frame Types
ROWS
- Frame is defined by physical row positions
- Deterministic and predictable
- Preferred for analytics
RANGE
- Frame is defined by logical value ranges
- Can include multiple rows with the same ordering value
- More subtle and engine-dependent
Common Frame Boundaries
| Boundary | Meaning |
|---|---|
UNBOUNDED PRECEDING |
From the first row in the partition |
CURRENT ROW |
The current row |
UNBOUNDED FOLLOWING |
Until the last row in the partition |
n PRECEDING |
n rows before |
n FOLLOWING |
n rows after |
Default Frame (Important)
For most databases:
ORDER BY columnimplicitly means:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWThis is not always what you want.
Example 1: Running Total (Explicit Frame)
SELECT
sale_id,
customer_id,
sale_date,
amount,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM tmp_sales;\[\downarrow\]
| sale_id | customer_id | sale_date | amount | running_total |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | 100 |
| 2 | A | 2024-01-02 | 120 | 220 |
| 3 | A | 2024-01-03 | 90 | 310 |
| 4 | A | 2024-01-04 | 130 | 440 |
| 5 | B | 2024-01-01 | 180 | 180 |
| 6 | B | 2024-01-02 | 200 | 380 |
| 7 | B | 2024-01-03 | 220 | 600 |
| 8 | B | 2024-01-04 | 200 | 800 |
| 9 | C | 2024-01-01 | 150 | 150 |
| 10 | C | 2024-01-02 | 150 | 300 |
| 11 | C | 2024-01-03 | 170 | 470 |
| 12 | D | 2024-01-01 | 90 | 90 |
| 13 | D | 2024-01-02 | 110 | 200 |
| 14 | E | 2024-01-01 | 140 | 140 |
| 15 | E | 2024-01-02 | 160 | 300 |
| 16 | E | 2024-01-03 | 155 | 455 |
Example 2: Full-Partition Aggregate (Stable Value)
SELECT
sale_id,
customer_id,
amount,
AVG(amount) OVER (
PARTITION BY customer_id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS full_partition_avg
FROM tmp_sales;\[\downarrow\]
| sale_id | customer_id | amount | full_partition_avg |
|---|---|---|---|
| 1 | A | 100 | 110.000 |
| 2 | A | 120 | 110.000 |
| 3 | A | 90 | 110.000 |
| 4 | A | 130 | 110.000 |
| 5 | B | 180 | 200.000 |
| 6 | B | 200 | 200.000 |
| 7 | B | 220 | 200.000 |
| 8 | B | 200 | 200.000 |
| 9 | C | 150 | 156.667 |
| 10 | C | 150 | 156.667 |
| 11 | C | 170 | 156.667 |
| 12 | D | 90 | 100.000 |
| 13 | D | 110 | 100.000 |
| 14 | E | 140 | 151.667 |
| 15 | E | 160 | 151.667 |
| 16 | E | 155 | 151.667 |
Example 3: Moving Window (Last 2 Transactions)
SELECT
sale_id,
customer_id,
sale_date,
amount,
AVG(amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
) AS moving_avg_2
FROM tmp_sales;Here we can us any aggregate function: SUM(), MAX()
\[\downarrow\]
| sale_id | customer_id | sale_date | amount | moving_avg_2 |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | 100.000 |
| 2 | A | 2024-01-02 | 120 | 110.000 |
| 3 | A | 2024-01-03 | 90 | 105.000 |
| 4 | A | 2024-01-04 | 130 | 110.000 |
| 5 | B | 2024-01-01 | 180 | 180.000 |
| 6 | B | 2024-01-02 | 200 | 190.000 |
| 7 | B | 2024-01-03 | 220 | 210.000 |
| 8 | B | 2024-01-04 | 200 | 210.000 |
| 9 | C | 2024-01-01 | 150 | 150.000 |
| 10 | C | 2024-01-02 | 150 | 150.000 |
| 11 | C | 2024-01-03 | 170 | 160.000 |
| 12 | D | 2024-01-01 | 90 | 90.000 |
| 13 | D | 2024-01-02 | 110 | 100.000 |
| 14 | E | 2024-01-01 | 140 | 140.000 |
| 15 | E | 2024-01-02 | 160 | 150.000 |
| 16 | E | 2024-01-03 | 155 | 157.500 |
Example 4: Forward-Looking Average
SELECT
sale_id,
customer_id,
sale_date,
amount,
AVG(amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING
) AS forward_avg_2
FROM tmp_sales;\[\downarrow\]
| sale_id | customer_id | sale_date | amount | forward_avg_2 |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | 110.0000000000000000 |
| 2 | A | 2024-01-02 | 120 | 105.0000000000000000 |
| 3 | A | 2024-01-03 | 90 | 110.0000000000000000 |
| 4 | A | 2024-01-04 | 130 | 130.0000000000000000 |
| 5 | B | 2024-01-01 | 180 | 190.0000000000000000 |
| 6 | B | 2024-01-02 | 200 | 210.0000000000000000 |
| 7 | B | 2024-01-03 | 220 | 210.0000000000000000 |
| 8 | B | 2024-01-04 | 200 | 200.0000000000000000 |
| 9 | C | 2024-01-01 | 150 | 150.0000000000000000 |
| 10 | C | 2024-01-02 | 150 | 160.0000000000000000 |
| 11 | C | 2024-01-03 | 170 | 170.0000000000000000 |
| 12 | D | 2024-01-01 | 90 | 100.0000000000000000 |
| 13 | D | 2024-01-02 | 110 | 110.0000000000000000 |
| 14 | E | 2024-01-01 | 140 | 150.0000000000000000 |
| 15 | E | 2024-01-02 | 160 | 157.5000000000000000 |
| 16 | E | 2024-01-03 | 155 | 155.0000000000000000 |
Example 5: Centered Moving Average (Previous + Current + Next)
SELECT
sale_id,
customer_id,
sale_date,
amount,
AVG(amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS centered_avg_3
FROM tmp_sales;\[\downarrow\]
| sale_id | customer_id | sale_date | amount | centered_avg_3 |
|---|---|---|---|---|
| 1 | A | 2024-01-01 | 100 | 110.00 |
| 2 | A | 2024-01-02 | 120 | 103.33 |
| 3 | A | 2024-01-03 | 90 | 113.33 |
| 4 | A | 2024-01-04 | 130 | 110.00 |
| 5 | B | 2024-01-01 | 180 | 190.00 |
| 6 | B | 2024-01-02 | 200 | 200.00 |
| 7 | B | 2024-01-03 | 220 | 206.67 |
| 8 | B | 2024-01-04 | 200 | 210.00 |
| 9 | C | 2024-01-01 | 150 | 150.00 |
| 10 | C | 2024-01-02 | 150 | 156.67 |
| 11 | C | 2024-01-03 | 170 | 160.00 |
| 12 | D | 2024-01-01 | 90 | 100.00 |
| 13 | D | 2024-01-02 | 110 | 100.00 |
| 14 | E | 2024-01-01 | 140 | 150.00 |
| 15 | E | 2024-01-02 | 160 | 151.67 |
| 16 | E | 2024-01-03 | 155 | 157.50 |
Key Observations
- Framing determines visibility, not grouping
- Many subtle bugs come from implicit frames
- Explicit frames improve:
- Correctness
- Readability
- Portability
- Correctness
When to Use Explicit Window Frames
- Running totals and cumulative metrics
- Moving averages and rolling windows
- Navigational functions (
FIRST_VALUE,LAST_VALUE)
- Any time correctness depends on row visibility
This layer gives you full control over window behavior, completing the mental model of advanced analytical SQL.
In-Class Assignment: Window Functions on the Analytics Schema
Context
You are working with the analytics schema of a transactional system that includes customers, orders, products, geography, and time.
Your task is to answer analytical questions using window functions, without collapsing rows unless explicitly stated.
You are expected to rely on:
PARTITION BYORDER BY- Appropriate window functions from the layers discussed in class
Do not use subqueries or CTEs unless explicitly requested.
Task 1: Simple Aggregate Window Functions
For each order, compute the average order value of the customer across all their orders.
- Use
AVG()as a window function - Preserve one row per order
Expected concepts:
AVG() OVER (PARTITION BY customer_id)- Row-level enrichment with group context
Task 2: Statistical Window Functions
For each customer, determine the percent rank of each order based on order revenue.
Steps:
- Compute order revenue as
\[\text{order\_revenue} = \sum (\text{quantity} \times \text{price})\] - Apply
PERCENT_RANK()within each customer
Expected concepts:
- Distribution within partitions
- Relative positioning of rows
Task 3: Value-from-Another-Row Functions
For each customer, compute the difference between the current order and the previous order in time.
- Orders must be compared chronologically
- The first order per customer should return
NULLfor the difference
Expected concepts:
LAG()- Time-aware ordering
- Change detection
Task 4: Ranking Functions
Within each city, rank customers by their total lifetime spend.
Requirements:
- Customers with the same spend must receive the same rank
- No gaps should appear in ranking
Expected concepts:
DENSE_RANK()- Partitioning by geography
- Ranking based on aggregated metrics
Task 5: String Aggregation Window Functions
For each customer, construct a chronological history of product categories they have purchased.
Requirements:
- Categories must be ordered by order date
- The full history must appear on every row for that customer
Expected concepts:
STRING_AGG()with window framing- Ordered categorical sequencing
- Behavioral traceability
Task 6: Channel Behavior Pattern (Advanced)
Create a result showing:
- A channel usage pattern (e.g.
online, store) - The number of customers exhibiting that pattern
Rules:
- Do not sort channels alphabetically
- Let the order emerge naturally from purchase behavior
- Customers using only one channel must appear as well
Expected concepts:
- String aggregation
- Grouping by derived behavioral signatures
- Sensitivity to ordering
Submission Guidelines
- Write one query per task
- Each query must be executable independently
- Focus on correctness, not formatting
- Prefer clarity over cleverness
This assignment evaluates your ability to reason across rows, not just aggregate them.