Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. SQL
  2. SQL
  3. Session 10: DA with SQL | Advanced SQL Functions
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Slides
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Window Functions
    • Why Window Functions Matter in Analytics
    • Window Function Types Layers
    • Window Syntax (Generic)
  • Create an Extended Dummy Temporary Table
    • Output
  • Window Function Layer 1: Simple Aggregate Window Functions
    • Concept
    • Common Functions in This Layer
    • Generic Pattern
    • Example 1: Average Amount per Customer
    • Example 2: Total Spend per Customer
    • Example 3: Transaction Count per Customer
    • Example 4: Minimum and Maximum Amount per Customer
    • Summary of Simple Aggregate Window Functions
    • When to Use Simple Aggregate Window Functions
  • Window Function Layer 2: Statistical Window Functions
    • Concept
    • Common Functions in This Layer
    • Generic Pattern
    • Example 1: Median Amount per Customer
    • Example 2: Percent Rank of Each Transaction
    • Example 3: Cumulative Distribution
    • Example 4: Quartile Assignment
    • Key Observations
    • When to Use Statistical Window Functions
  • Window Function Layer 3: Value-from-Another-Row Functions
    • Concept
    • Common Functions in This Layer
    • Generic Pattern
    • Example 1: Previous Transaction Amount (LAG)
    • Example 2: Next Transaction Amount (LEAD)
    • Example 3: Change Since Previous Transaction
    • Example 4: First Transaction Amount per Customer
    • Example 5: Last Transaction Amount per Customer
    • When to Use Value-from-Another-Row Functions
  • Window Function Layer 4: Ranking Functions
    • Concept
    • Common Functions in This Layer
    • Generic Pattern
    • Function Semantics
    • Example 1: Sequential Ordering (ROW_NUMBER)
    • Example 2: Dense Ranking by Amount (DENSE_RANK)
    • Example 3: Ranking by Amount with Gaps (RANK)
    • Example 4: Comparison
    • Summary of Ranking Functions
    • Practical Analytics Use Cases
    • When to Use Ranking Functions
  • Window Function Layer 5: String Aggregation Window Functions
    • Concept
    • Common Functions in This Layer
    • Generic Pattern
    • Example 1: Channels Used by Each Customer (Cumulative)
    • Example 2: Full Channel History per Customer
    • Example 3: Channel Combination Pattern 1
    • Example 4: Channel Combination Pattern 2
    • Example 5: Channel Combination Pattern Grouped
    • Summary of String Aggregation Window Functions
    • Practical Analytics Use Cases
    • When to Use String Aggregation Window Functions
  • Window Function Layer 6: Window Framing & Advanced Window Control
    • Concept
    • Frame Dimensions
    • Frame Types
    • Common Frame Boundaries
    • Default Frame (Important)
    • Example 1: Running Total (Explicit Frame)
    • Example 2: Full-Partition Aggregate (Stable Value)
    • Example 3: Moving Window (Last 2 Transactions)
    • Example 4: Forward-Looking Average
    • Example 5: Centered Moving Average (Previous + Current + Next)
    • Key Observations
    • When to Use Explicit Window Frames
  • In-Class Assignment: Window Functions on the Analytics Schema
    • Context
    • Task 1: Simple Aggregate Window Functions
    • Task 2: Statistical Window Functions
    • Task 3: Value-from-Another-Row Functions
    • Task 4: Ranking Functions
    • Task 5: String Aggregation Window Functions
    • Task 6: Channel Behavior Pattern (Advanced)
    • Submission Guidelines
  1. SQL
  2. SQL
  3. Session 10: DA with SQL | Advanced SQL Functions

Session 10: DA with SQL | Advanced SQL Functions

Window Functions
LAG
LEAD
ROW_NUMBER
RANK
DENSE_RANK

Window Functions

A window function is a SQL function that performs a calculation over a set of rows related to the current row, called a window, without collapsing rows.

This is the defining property:

A window function adds information to each row by looking at other rows.

Formally, a window function:

  • Operates over a window of rows defined by OVER(...)
  • Preserves the original number of rows
  • Attaches the computed value back to each row

This distinguishes window functions from:

  • GROUP BY aggregates (which reduce rows)
  • Subqueries and CTEs (which restructure data)

Mental model:

Each row keeps its identity, but gains context.

Why Window Functions Matter in Analytics

Analytical questions often require comparison, not just aggregation.

Examples:

  • How does this order compare to the customer’s average?
  • What changed since the previous transaction?
  • What is the rank of this event within a group?
  • How does this value evolve over time?

Window functions are designed precisely for these tasks.

Window Function Types Layers

  1. Aggregate over a window (e.g., SUM(), AVG(), COUNT())
  2. Value from another row (e.g., LAG(), LEAD())
  3. Ranking functions (e.g., ROW_NUMBER(), RANK(), DENSE_RANK())
  4. Cumulative distributions (e.g., CUME_DIST(), PERCENT_RANK())
  5. String aggregation (e.g., STRING_AGG())
  6. Navigational functions (e.g., FIRST_VALUE(), LAST_VALUE())
  7. Framing & Advanced Window Control (Advanced)

Window Syntax (Generic)

function_name(expression)
OVER (
    PARTITION BY ...
    ORDER BY ...
)
  • PARTITION BY defines who is compared with whom
  • ORDER BY defines sequence inside the window
  • The window determines which rows are visible

Create an Extended Dummy Temporary Table

Before applying window functions to the analytics schema, we use a controlled temporary table with:

  • Multiple entities
  • Time ordering
  • Repeated values (for ranking)
  • A categorical column (for STRING_AGG)
CREATE TEMP TABLE tmp_sales AS
SELECT *
FROM (
    VALUES
        
        (1,  'A', DATE '2024-01-01', 100, 'online'),
        (2,  'A', DATE '2024-01-02', 120, 'store'),
        (3,  'A', DATE '2024-01-03', 90,  'online'),
        (4,  'A', DATE '2024-01-04', 130, 'store'),

        
        (5,  'B', DATE '2024-01-01', 180, 'store'),
        (6,  'B', DATE '2024-01-02', 200, 'online'),
        (7,  'B', DATE '2024-01-03', 220, 'online'),
        (8,  'B', DATE '2024-01-04', 200, 'store'),

        
        (9,  'C', DATE '2024-01-01', 150, 'online'),
        (10, 'C', DATE '2024-01-02', 150, 'online'),
        (11, 'C', DATE '2024-01-03', 170, 'online'),

        
        (12, 'D', DATE '2024-01-01', 90,  'store'),
        (13, 'D', DATE '2024-01-02', 110, 'store'),

        
        (14, 'E', DATE '2024-01-01', 140, 'store'),
        (15, 'E', DATE '2024-01-02', 160, 'online'),
        (16, 'E', DATE '2024-01-03', 155, 'store')
) AS t(
    sale_id,
    customer_id,
    sale_date,
    amount,
    channel
);

Output

sale_id customer_id sale_date amount channel
1 A 2024-01-01 100 online
2 A 2024-01-02 120 store
3 A 2024-01-03 90 online
4 A 2024-01-04 130 store
5 B 2024-01-01 180 store
6 B 2024-01-02 200 online
7 B 2024-01-03 220 online
8 B 2024-01-04 200 store
9 C 2024-01-01 150 online
10 C 2024-01-02 150 online
11 C 2024-01-03 170 online
12 D 2024-01-01 90 store
13 D 2024-01-02 110 store
14 E 2024-01-01 140 store
15 E 2024-01-02 160 online
16 E 2024-01-03 155 store

Window Function Layer 1: Simple Aggregate Window Functions

Simple aggregate window functions apply classical aggregation logic (such as SUM, AVG, COUNT, MIN, MAX) without collapsing rows.

They answer the question:

How does this row relate to an aggregate of its group/partition?


Concept

A simple aggregate window function computes an aggregate value over a partition of rows and attaches that value to every row in the partition.

Unlike GROUP BY:

  • Rows are not reduced
  • Each row keeps its original granularity
  • Aggregates become contextual attributes

Compute a group-level metric, then stamp it onto every row in the group.

Common Functions in This Layer

  • SUM()
  • AVG()
  • COUNT()
  • MIN()
  • MAX()

All of them become window functions when combined with OVER(...).

Generic Pattern

AGGREGATE_FUNCTION(column)
OVER (
    PARTITION BY grouping_column
)

Example 1: Average Amount per Customer

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    AVG(amount) OVER (PARTITION BY customer_id) AS avg_customer_amount
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id sale_date amount avg_customer_amount
1 A 2024-01-01 100 110.000
2 A 2024-01-02 120 110.000
3 A 2024-01-03 90 110.000
4 A 2024-01-04 130 110.000
5 B 2024-01-01 180 200.000
6 B 2024-01-02 200 200.000
7 B 2024-01-03 220 200.000
8 B 2024-01-04 200 200.000
9 C 2024-01-01 150 156.667
10 C 2024-01-02 150 156.667
11 C 2024-01-03 170 156.667
12 D 2024-01-01 90 100.000
13 D 2024-01-02 110 100.000
14 E 2024-01-01 140 151.667
15 E 2024-01-02 160 151.667
16 E 2024-01-03 155 151.667

\(\downarrow\)

Customer A
Amounts: 100 120 90 130
Average: (100 + 120 + 90 + 130) / 4 = 110

Each row sees ALL rows in partition:

[100, 120, 90, 130] → 110
[100, 120, 90, 130] → 110
[100, 120, 90, 130] → 110
[100, 120, 90, 130] → 110

Example 2: Total Spend per Customer

SELECT
    sale_id,
    customer_id,
    amount,
    SUM(amount) OVER (
        PARTITION BY customer_id
    ) AS total_customer_spend
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id amount total_customer_spend
1 A 100 440.000
2 A 120 440.000
3 A 90 440.000
4 A 130 440.000
5 B 180 800.000
6 B 200 800.000
7 B 220 800.000
8 B 200 800.000
9 C 150 470.000
10 C 150 470.000
11 C 170 470.000
12 D 90 200.000
13 D 110 200.000
14 E 140 455.000
15 E 160 455.000
16 E 155 455.000

\(\downarrow\)

Customer A
Amounts: 100 120 90 130
Total: 100 + 120 + 90 + 130 = 440

Each row sees ALL rows in partition:

[100, 120, 90, 130] → 440
[100, 120, 90, 130] → 440
[100, 120, 90, 130] → 440
[100, 120, 90, 130] → 440

Example 3: Transaction Count per Customer

SELECT
    sale_id,
    customer_id,
    COUNT(*) OVER (
        PARTITION BY customer_id
    ) AS transaction_count
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id transaction_count
1 A 4
2 A 4
3 A 4
4 A 4
5 B 4
6 B 4
7 B 4
8 B 4
9 C 3
10 C 3
11 C 3
12 D 2
13 D 2
14 E 3
15 E 3
16 E 3

\(\downarrow\)

Customer A
Transactions: 4 rows

Each row sees ALL rows in partition:

[1, 2, 3, 4] → 4
[1, 2, 3, 4] → 4
[1, 2, 3, 4] → 4
[1, 2, 3, 4] → 4

Example 4: Minimum and Maximum Amount per Customer

SELECT
    sale_id,
    customer_id,
    amount,
    MIN(amount) OVER (
        PARTITION BY customer_id
    ) AS min_amount,
    MAX(amount) OVER (
        PARTITION BY customer_id
    ) AS max_amount
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id amount min_amount max_amount
1 A 100 90.000 130.000
2 A 120 90.000 130.000
3 A 90 90.000 130.000
4 A 130 90.000 130.000
5 B 180 180.000 220.000
6 B 200 180.000 220.000
7 B 220 180.000 220.000
8 B 200 180.000 220.000
9 C 150 150.000 170.000
10 C 150 150.000 170.000
11 C 170 150.000 170.000
12 D 90 90.000 110.000
13 D 110 90.000 110.000
14 E 140 140.000 160.000
15 E 160 140.000 160.000
16 E 155 140.000 160.000

\(\downarrow\)

Customer A
Amounts: 100 120 90 130
Min: 90
Max: 130

Each row sees ALL rows in partition:

[100, 120, 90, 130] → min=90, max=130
[100, 120, 90, 130] → min=90, max=130
[100, 120, 90, 130] → min=90, max=130
[100, 120, 90, 130] → min=90, max=130

Summary of Simple Aggregate Window Functions

  • Aggregation logic is identical to GROUP BY
  • The difference is row preservation
  • These functions create reference baselines
  • They are often the foundation for:
    • Variance analysis
    • Normalization
    • Z-scores
    • Behavioral comparisons

When to Use Simple Aggregate Window Functions

  • Comparing a row to its group average
  • Attaching group-level metrics to granular data
  • Preparing features for modeling
  • Avoiding joins or subqueries for simple enrichments

This layer forms the foundation for all advanced window logic.

Window Function Layer 2: Statistical Window Functions

Statistical window functions extend simple aggregates by answering distributional questions instead of basic totals or averages.

They answer the question:

Where does this row stand within the distribution of its group?


Concept

A statistical window function computes position, spread, or relative standing of a row within a partition.

Unlike simple aggregates:

  • They focus on ordering and distribution
  • They often require ORDER BY
  • Results depend on relative position, not just totals

These functions are essential for exploratory data analysis, outlier detection, and ranking-based insights.


Common Functions in This Layer

  • PERCENTILE_CONT()
  • PERCENTILE_DISC()
  • MEDIAN()
  • PERCENT_RANK()
  • CUME_DIST()

Generic Pattern

STAT_FUNCTION(...)
OVER (
    PARTITION BY grouping_column
    ORDER BY ordering_column
)

Example 1: Median Amount per Customer

SELECT
    customer_id,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount)  AS median_amount
FROM tmp_sales
GROUP BY customer_id;

\(\downarrow\)

customer_id median_amount
A 110.000
B 200.000
C 150.000
D 100.000
E 155.000

\(\downarrow\)

Customer A
Sorted amounts: 90 100 120 130
Median = (100 + 120) / 2 = 110

Customer C
Sorted amounts: 150 150 170
Median = 150

Example 2: Percent Rank of Each Transaction

SELECT
    sale_id,
    customer_id,
    amount,
    PERCENT_RANK() OVER (
        PARTITION BY customer_id
        ORDER BY amount
    ) AS percent_rank
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id amount percent_rank
3 A 90 0.000
1 A 100 0.333
2 A 120 0.667
4 A 130 1.000
5 B 180 0.000
6 B 200 0.333
8 B 200 0.333
7 B 220 1.000
9 C 150 0.000
10 C 150 0.000
11 C 170 1.000

\(\downarrow\)

Customer A
Sorted: 90 100 120 130

Position formula:
(rank - 1) / (n - 1)

90  → (1-1)/3 = 0
100 → (2-1)/3 = 0.33
120 → (3-1)/3 = 0.67
130 → (4-1)/3 = 1

Example 3: Cumulative Distribution

SELECT
    sale_id,
    customer_id,
    amount,
    CUME_DIST() OVER (
        PARTITION BY customer_id
        ORDER BY amount
    ) AS cumulative_distribution
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id amount cumulative_distribution
3 A 90 0.250
1 A 100 0.500
2 A 120 0.750
4 A 130 1.000
5 B 180 0.250
6 B 200 0.750
8 B 200 0.750
7 B 220 1.000

\(\downarrow\)

Customer A
Sorted: 90 100 120 130

CUME_DIST = (# values ≤ current) / total

90  → 1/4 = 0.25
100 → 2/4 = 0.50
120 → 3/4 = 0.75
130 → 4/4 = 1

Example 4: Quartile Assignment

SELECT
    sale_id,
    customer_id,
    amount,
    NTILE(4) OVER (
        PARTITION BY customer_id
        ORDER BY amount
    ) AS quartile
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id amount quartile
3 A 90 1
1 A 100 2
2 A 120 3
4 A 130 4
5 B 180 1
6 B 200 2
8 B 200 3
7 B 220 4

\(\downarrow\)

Customer A
Sorted: 90 100 120 130

Split into 4 equal buckets:

90  → Q1
100 → Q2
120 → Q3
130 → Q4

Key Observations

  • Statistical window functions are order-sensitive
  • They expose distribution shape
  • Results are relative, not absolute
  • They are critical for:
    • Outlier detection
    • Segmentation
    • Threshold-based decisions

When to Use Statistical Window Functions

  • Understanding customer behavior distributions
  • Identifying top/bottom performers
  • Creating quantile-based segments
  • Normalizing values for comparison

This layer bridges descriptive statistics and analytical SQL, preparing the ground for ranking and time-aware functions.

Window Function Layer 3: Value-from-Another-Row Functions

Value-from-another-row window functions allow a row to access values from neighboring rows within the same partition.

They answer the question:

How does this row relate to previous or next rows in sequence?


Concept

A value-from-another-row window function retrieves a value from a different row in the same window, based on ordering.

These functions are inherently time-aware and sequence-aware.

Unlike aggregate or statistical functions:

  • They do not summarize a group
  • They compare a row to specific neighboring rows
  • Ordering is mandatory

Common Functions in This Layer

  • LAG()
  • LEAD()
  • FIRST_VALUE()
  • LAST_VALUE()

Generic Pattern

VALUE_FUNCTION(column [, offset])
OVER (
    PARTITION BY grouping_column
    ORDER BY ordering_column
)
  • offset (optional) defines how many rows forward/backward to look
  • Default offset is 1

Example 1: Previous Transaction Amount (LAG)

Business intuition:
“How did this transaction change compared to the previous one?”

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    LAG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS previous_amount
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id sale_date amount previous_amount
1 A 2024-01-01 100
2 A 2024-01-02 120 100
3 A 2024-01-03 90 120
4 A 2024-01-04 130 90
5 B 2024-01-01 180
6 B 2024-01-02 200 180
7 B 2024-01-03 220 200
8 B 2024-01-04 200 220
9 C 2024-01-01 150
10 C 2024-01-02 150 150
11 C 2024-01-03 170 150
12 D 2024-01-01 90
13 D 2024-01-02 110 90
14 E 2024-01-01 140
15 E 2024-01-02 160 140
16 E 2024-01-03 155 160

\(\downarrow\)

Customer A
Transactions by date: 100 120 90 130

Current row looks 1 row backward:

100 → previous = NULL
120 → previous = 100
90  → previous = 120
130 → previous = 90

Example 2: Next Transaction Amount (LEAD)

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    LEAD(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS next_amount
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id sale_date amount next_amount
1 A 2024-01-01 100 120
2 A 2024-01-02 120 90
3 A 2024-01-03 90 130
4 A 2024-01-04 130
5 B 2024-01-01 180 200
6 B 2024-01-02 200 220
7 B 2024-01-03 220 200
8 B 2024-01-04 200
9 C 2024-01-01 150 150
10 C 2024-01-02 150 170
11 C 2024-01-03 170
12 D 2024-01-01 90 110
13 D 2024-01-02 110
14 E 2024-01-01 140 160
15 E 2024-01-02 160 155
16 E 2024-01-03 155

\(\downarrow\)

Customer A
Transactions by date: 100 120 90 130

Current row looks 1 row forward:

100 → next = 120
120 → next = 90
90  → next = 130
130 → next = NULL

Example 3: Change Since Previous Transaction

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    amount
      - LAG(amount) OVER (
            PARTITION BY customer_id
            ORDER BY sale_date
        ) AS amount_change
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id sale_date amount amount_change
1 A 2024-01-01 100
2 A 2024-01-02 120 20
3 A 2024-01-03 90 -30
4 A 2024-01-04 130 40
5 B 2024-01-01 180
6 B 2024-01-02 200 20
7 B 2024-01-03 220 20
8 B 2024-01-04 200 -20
9 C 2024-01-01 150
10 C 2024-01-02 150 0
11 C 2024-01-03 170 20
12 D 2024-01-01 90
13 D 2024-01-02 110 20
14 E 2024-01-01 140
15 E 2024-01-02 160 20
16 E 2024-01-03 155 -5

\(\downarrow\)

Customer A
Transactions by date: 100 120 90 130

Formula:
current amount - previous amount

100 → NULL
120 - 100 = 20
90  - 120 = -30
130 - 90  = 40

Example 4: First Transaction Amount per Customer

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    FIRST_VALUE(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS first_amount
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id sale_date amount first_amount
1 A 2024-01-01 100 100
2 A 2024-01-02 120 100
3 A 2024-01-03 90 100
4 A 2024-01-04 130 100
5 B 2024-01-01 180 180
6 B 2024-01-02 200 180
7 B 2024-01-03 220 180
8 B 2024-01-04 200 180
9 C 2024-01-01 150 150
10 C 2024-01-02 150 150
11 C 2024-01-03 170 150
12 D 2024-01-01 90 90
13 D 2024-01-02 110 90
14 E 2024-01-01 140 140
15 E 2024-01-02 160 140
16 E 2024-01-03 155 140

\(\downarrow\)

Customer A
Transactions by date: 100 120 90 130

First row in partition = 100

100 → 100
120 → 100
90  → 100
130 → 100

Example 5: Last Transaction Amount per Customer

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    LAST_VALUE(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS last_amount
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id sale_date amount last_amount
1 A 2024-01-01 100 130
2 A 2024-01-02 120 130
3 A 2024-01-03 90 130
4 A 2024-01-04 130 130
5 B 2024-01-01 180 200
6 B 2024-01-02 200 200
7 B 2024-01-03 220 200
8 B 2024-01-04 200 200
9 C 2024-01-01 150 170
10 C 2024-01-02 150 170
11 C 2024-01-03 170 170
12 D 2024-01-01 90 110
13 D 2024-01-02 110 110
14 E 2024-01-01 140 155
15 E 2024-01-02 160 155
16 E 2024-01-03 155 155

\(\downarrow\)

Customer A
Transactions by date: 100 120 90 130

Last row in partition = 130

100 → 130
120 → 130
90  → 130
130 → 130
Important
  • Ordering is mandatory
  • These functions are directional
  • They enable:
    • Trend analysis
    • Change detection
    • Time-series feature engineering

When to Use Value-from-Another-Row Functions

  • Measuring growth or decline
  • Detecting behavioral changes
  • Comparing events over time
  • Preparing lagged features for models

This layer introduces temporal reasoning, which is fundamental for advanced analytics and forecasting.

Window Function Layer 4: Ranking Functions

Ranking window functions assign ordinal positions to rows within a partition based on a specified ordering.

They answer the question:

How does this row rank relative to other rows in its group?


Concept

A ranking window function assigns a numeric rank to each row according to an ORDER BY rule inside a partition.

Unlike statistical functions:

  • They produce discrete ranks, not continuous measures
  • They are sensitive to ties
  • They are commonly used for top-N analysis and leaderboards

Common Functions in This Layer

  • ROW_NUMBER()
  • RANK()
  • DENSE_RANK()

Each function handles ties differently.


Generic Pattern

RANKING_FUNCTION()
OVER (
    PARTITION BY grouping_column
    ORDER BY ordering_column [ASC | DESC]
)

Function Semantics

Function Handles Ties Gaps in Ranking Typical Use
ROW_NUMBER() No Yes Unique ordering
RANK() Yes Yes Competition ranking
DENSE_RANK() Yes No Tier-based ranking

Example 1: Sequential Ordering (ROW_NUMBER)

Business intuition:
“What is the chronological position of each transaction?”

SELECT
    sale_id,
    customer_id,
    sale_date,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS row_number
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id sale_date row_number
1 A 2024-01-01 1
2 A 2024-01-02 2
3 A 2024-01-03 3
4 A 2024-01-04 4
5 B 2024-01-01 1
6 B 2024-01-02 2
7 B 2024-01-03 3
8 B 2024-01-04 4
9 C 2024-01-01 1
10 C 2024-01-02 2
11 C 2024-01-03 3
12 D 2024-01-01 1
13 D 2024-01-02 2
14 E 2024-01-01 1
15 E 2024-01-02 2
16 E 2024-01-03 3

Example 2: Dense Ranking by Amount (DENSE_RANK)

SELECT
    sale_id,
    customer_id,
    amount,
    DENSE_RANK() OVER (
        PARTITION BY customer_id
        ORDER BY amount DESC
    ) AS dense_rank_amount
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id amount dense_rank_amount
4 A 130 1
2 A 120 2
1 A 100 3
3 A 90 4
7 B 220 1
6 B 200 2
8 B 200 2
5 B 180 3
11 C 170 1
9 C 150 2
10 C 150 2
13 D 110 1
12 D 90 2
15 E 160 1
16 E 155 2
14 E 140 3

Example 3: Ranking by Amount with Gaps (RANK)

Business intuition: “What is the spending rank of each transaction?”

SELECT
    sale_id,
    customer_id,
    amount,
    RANK() OVER (
        PARTITION BY customer_id
        ORDER BY amount DESC
    ) AS rank_amount
FROM tmp_sales;

\(\downarrow\)

sale_id customer_id amount rank_amount
4 A 130 1
2 A 120 2
1 A 100 3
3 A 90 4
7 B 220 1
6 B 200 2
8 B 200 2
5 B 180 4
11 C 170 1
9 C 150 2
10 C 150 2
13 D 110 1
12 D 90 2
15 E 160 1
16 E 155 2
14 E 140 3

Example 4: Comparison

SELECT
    sale_id,
    customer_id,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id
        ORDER BY amount DESC
    ) AS row_number_amount,
    RANK() OVER (
        PARTITION BY customer_id
        ORDER BY amount DESC
    ) AS rank_amount,
    DENSE_RANK() OVER (
        PARTITION BY customer_id
        ORDER BY amount DESC
    ) AS dense_rank_amount
FROM tmp_sales
ORDER BY customer_id, amount DESC, sale_id;

\(\downarrow\)

sale_id customer_id amount row_number_amount rank_amount dense_rank_amount
4 A 130 1 1 1
2 A 120 2 2 2
1 A 100 3 3 3
3 A 90 4 4 4
7 B 220 1 1 1
6 B 200 2 2 2
8 B 200 3 2 2
5 B 180 4 4 3
11 C 170 1 1 1
9 C 150 2 2 2
10 C 150 3 2 2
13 D 110 1 1 1
12 D 90 2 2 2
15 E 160 1 1 1
16 E 155 2 2 2
14 E 140 3 3 3

Summary of Ranking Functions

Important
  • Ranking always requires ORDER BY
  • ROW_NUMBER() enforces uniqueness
  • RANK() leaves gaps after ties
  • DENSE_RANK() preserves continuity

Practical Analytics Use Cases

  • Top-N customers or transactions
  • Identifying best/worst performers
  • De-duplication with deterministic ordering
  • Feature engineering for modeling

When to Use Ranking Functions

  • Creating leaderboards
  • Selecting first / top records per group
  • Resolving duplicates deterministically
  • Segmenting by relative performance

Ranking functions introduce comparative structure, preparing the ground for cumulative and navigational window logic.

Window Function Layer 5: String Aggregation Window Functions

String aggregation window functions allow rows to accumulate textual values from related rows without collapsing the dataset.

They answer the question:

What sequence or collection of categorical values is associated with this row’s group?


Concept

A string aggregation window function concatenates values from multiple rows within a partition into a single textual representation, while preserving row-level granularity.

Unlike classic GROUP BY STRING_AGG:

  • Rows are not reduced
  • The aggregated string is attached to each row
  • Ordering controls how the string is built

This makes string aggregation window functions especially useful for behavioral tracing, auditability, and sequence analysis.


Common Functions in This Layer

  • STRING_AGG()

When combined with OVER(...), STRING_AGG() becomes a contextual, row-preserving operation.


Generic Pattern

STRING_AGG(column, delimiter)
OVER (
    PARTITION BY grouping_column
    ORDER BY ordering_column
)
  • delimiter defines how values are separated
  • ORDER BY defines the sequence of concatenation
  • The result grows as the window progresses

Example 1: Channels Used by Each Customer (Cumulative)

Business intuition: “What interaction channels has this customer used up to this point?”

. . .

SELECT
    sale_id,
    customer_id,
    sale_date,
    channel,
    STRING_AGG(channel, ', ') OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
    ) AS channels_used_so_far
FROM tmp_sales;

\[\downarrow\]

sale_id customer_id sale_date channel channels_used_so_far
1 A 2024-01-01 online online
2 A 2024-01-02 store online, store
3 A 2024-01-03 online online, store, online
4 A 2024-01-04 store online, store, online, store
5 B 2024-01-01 store store
6 B 2024-01-02 online store, online
7 B 2024-01-03 online store, online, online
8 B 2024-01-04 store store, online, online, store
9 C 2024-01-01 online online
10 C 2024-01-02 online online, online
11 C 2024-01-03 online online, online, online
12 D 2024-01-01 store store
13 D 2024-01-02 store store, store
14 E 2024-01-01 store store
15 E 2024-01-02 online store, online
16 E 2024-01-03 store store, online, store

Example 2: Full Channel History per Customer

To attach the complete history to every row, use an explicit frame.

SELECT
    sale_id,
    customer_id,
    channel,
    STRING_AGG(channel, ', ') OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS full_channel_history
FROM tmp_sales;

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING:

  • UNBOUNDED PRECEDING → start from the first row in the partition
  • UNBOUNDED FOLLOWING → go until the last row in the partition

\[\downarrow\]

sale_id customer_id channel full_channel_history
1 A online online, store, online, store
2 A store online, store, online, store
3 A online online, store, online, store
4 A store online, store, online, store
5 B store store, online, online, store
6 B online store, online, online, store
7 B online store, online, online, store
8 B store store, online, online, store
9 C online online, online, online
10 C online online, online, online
11 C online online, online, online
12 D store store, store
13 D store store, store
14 E store store, online, store
15 E online store, online, store
16 E store store, online, store

Example 3: Channel Combination Pattern 1

SELECT
    customer_id,
    STRING_AGG(channel, ', ' ORDER BY channel) AS channel_pattern
    
FROM tmp_sales
GROUP BY customer_id

\[\downarrow\]

customer_id channel_pattern
A online, online, store, store
B online, online, store, store
C online, online, online
D store, store
E online, store, store
Important

try without ORDER BY in STRING_AGG to see different results

Example 4: Channel Combination Pattern 2

SELECT
    customer_id,
    STRING_AGG(channel, ', ' ORDER BY channel) AS channel_pattern,
    STRING_AGG(DISTINCT channel, ', ' ORDER BY channel) AS channel_pattern_distinct
FROM tmp_sales
GROUP BY customer_id

\[\downarrow\]

customer_id channel_pattern channel_pattern_distinct
A online, online, store, store online, store
B online, online, store, store online, store
C online, online, online online
D store, store store
E online, store, store online, store

Example 5: Channel Combination Pattern Grouped

WITH customer_channels AS (
    SELECT
        customer_id,
        STRING_AGG(DISTINCT channel, ', ' ORDER BY channel) AS channel_pattern
    FROM tmp_sales
    GROUP BY customer_id
)
SELECT
    channel_pattern,
    COUNT(*) AS customer_count
FROM customer_channels
GROUP BY channel_pattern;

\[\downarrow\]

channel_pattern customer_count
online 1
store 1
online, store 3

Summary of String Aggregation Window Functions

  • Ordering determines string growth
  • Without a frame, aggregation is cumulative
  • With a full frame, aggregation is static per group
  • Strings become contextual descriptors, not summaries

Practical Analytics Use Cases

  • Customer journey reconstruction
  • Channel or event sequence analysis
  • Audit trails and explainability
  • Feature generation for downstream models

When to Use String Aggregation Window Functions

  • When order matters in categorical data
  • When you need traceability, not just counts
  • When building interpretable analytical features
  • When collapsing rows would lose information

This layer completes the transition from numeric comparison to behavioral sequencing, setting the stage for advanced window framing and analytics patterns.

Window Function Layer 6: Window Framing & Advanced Window Control

Window framing controls which rows inside the partition are visible to the window function for each row.

They answer the question:

Exactly which subset of rows should be used when computing the window value for this row?

Concept

A window frame defines a moving or fixed subset of rows relative to the current row.

While:

  • PARTITION BY defines who belongs together
  • ORDER BY defines sequence

The frame defines how far backward and forward the window can see.

Without an explicit frame, SQL engines apply default framing rules, which can lead to unexpected results, especially for cumulative and navigational functions.


Frame Dimensions

A frame is defined using:

  • Frame type
  • Frame boundaries
ROWS | RANGE BETWEEN <start> AND <end>

Frame Types

ROWS

  • Frame is defined by physical row positions
  • Deterministic and predictable
  • Preferred for analytics

RANGE

  • Frame is defined by logical value ranges
  • Can include multiple rows with the same ordering value
  • More subtle and engine-dependent

Common Frame Boundaries

Boundary Meaning
UNBOUNDED PRECEDING From the first row in the partition
CURRENT ROW The current row
UNBOUNDED FOLLOWING Until the last row in the partition
n PRECEDING n rows before
n FOLLOWING n rows after

Default Frame (Important)

For most databases:

ORDER BY column

implicitly means:

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

This is not always what you want.


Example 1: Running Total (Explicit Frame)

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM tmp_sales;

\[\downarrow\]

sale_id customer_id sale_date amount running_total
1 A 2024-01-01 100 100
2 A 2024-01-02 120 220
3 A 2024-01-03 90 310
4 A 2024-01-04 130 440
5 B 2024-01-01 180 180
6 B 2024-01-02 200 380
7 B 2024-01-03 220 600
8 B 2024-01-04 200 800
9 C 2024-01-01 150 150
10 C 2024-01-02 150 300
11 C 2024-01-03 170 470
12 D 2024-01-01 90 90
13 D 2024-01-02 110 200
14 E 2024-01-01 140 140
15 E 2024-01-02 160 300
16 E 2024-01-03 155 455

Example 2: Full-Partition Aggregate (Stable Value)

SELECT
    sale_id,
    customer_id,
    amount,
    AVG(amount) OVER (
        PARTITION BY customer_id
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS full_partition_avg
FROM tmp_sales;

\[\downarrow\]

sale_id customer_id amount full_partition_avg
1 A 100 110.000
2 A 120 110.000
3 A 90 110.000
4 A 130 110.000
5 B 180 200.000
6 B 200 200.000
7 B 220 200.000
8 B 200 200.000
9 C 150 156.667
10 C 150 156.667
11 C 170 156.667
12 D 90 100.000
13 D 110 100.000
14 E 140 151.667
15 E 160 151.667
16 E 155 151.667

Example 3: Moving Window (Last 2 Transactions)

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    AVG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
    ) AS moving_avg_2
FROM tmp_sales;
Note

Here we can us any aggregate function: SUM(), MAX()

\[\downarrow\]

sale_id customer_id sale_date amount moving_avg_2
1 A 2024-01-01 100 100.000
2 A 2024-01-02 120 110.000
3 A 2024-01-03 90 105.000
4 A 2024-01-04 130 110.000
5 B 2024-01-01 180 180.000
6 B 2024-01-02 200 190.000
7 B 2024-01-03 220 210.000
8 B 2024-01-04 200 210.000
9 C 2024-01-01 150 150.000
10 C 2024-01-02 150 150.000
11 C 2024-01-03 170 160.000
12 D 2024-01-01 90 90.000
13 D 2024-01-02 110 100.000
14 E 2024-01-01 140 140.000
15 E 2024-01-02 160 150.000
16 E 2024-01-03 155 157.500

Example 4: Forward-Looking Average

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    AVG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING
    ) AS forward_avg_2
FROM tmp_sales;

\[\downarrow\]

sale_id customer_id sale_date amount forward_avg_2
1 A 2024-01-01 100 110.0000000000000000
2 A 2024-01-02 120 105.0000000000000000
3 A 2024-01-03 90 110.0000000000000000
4 A 2024-01-04 130 130.0000000000000000
5 B 2024-01-01 180 190.0000000000000000
6 B 2024-01-02 200 210.0000000000000000
7 B 2024-01-03 220 210.0000000000000000
8 B 2024-01-04 200 200.0000000000000000
9 C 2024-01-01 150 150.0000000000000000
10 C 2024-01-02 150 160.0000000000000000
11 C 2024-01-03 170 170.0000000000000000
12 D 2024-01-01 90 100.0000000000000000
13 D 2024-01-02 110 110.0000000000000000
14 E 2024-01-01 140 150.0000000000000000
15 E 2024-01-02 160 157.5000000000000000
16 E 2024-01-03 155 155.0000000000000000

Example 5: Centered Moving Average (Previous + Current + Next)

SELECT
    sale_id,
    customer_id,
    sale_date,
    amount,
    AVG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY sale_date
        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) AS centered_avg_3
FROM tmp_sales;

\[\downarrow\]

sale_id customer_id sale_date amount centered_avg_3
1 A 2024-01-01 100 110.00
2 A 2024-01-02 120 103.33
3 A 2024-01-03 90 113.33
4 A 2024-01-04 130 110.00
5 B 2024-01-01 180 190.00
6 B 2024-01-02 200 200.00
7 B 2024-01-03 220 206.67
8 B 2024-01-04 200 210.00
9 C 2024-01-01 150 150.00
10 C 2024-01-02 150 156.67
11 C 2024-01-03 170 160.00
12 D 2024-01-01 90 100.00
13 D 2024-01-02 110 100.00
14 E 2024-01-01 140 150.00
15 E 2024-01-02 160 151.67
16 E 2024-01-03 155 157.50

Key Observations

  • Framing determines visibility, not grouping
  • Many subtle bugs come from implicit frames
  • Explicit frames improve:
    • Correctness
    • Readability
    • Portability

When to Use Explicit Window Frames

  • Running totals and cumulative metrics
  • Moving averages and rolling windows
  • Navigational functions (FIRST_VALUE, LAST_VALUE)
  • Any time correctness depends on row visibility

This layer gives you full control over window behavior, completing the mental model of advanced analytical SQL.

In-Class Assignment: Window Functions on the Analytics Schema

Context

You are working with the analytics schema of a transactional system that includes customers, orders, products, geography, and time.

Your task is to answer analytical questions using window functions, without collapsing rows unless explicitly stated.

You are expected to rely on:

  • PARTITION BY
  • ORDER BY
  • Appropriate window functions from the layers discussed in class

Do not use subqueries or CTEs unless explicitly requested.


Task 1: Simple Aggregate Window Functions

For each order, compute the average order value of the customer across all their orders.

  • Use AVG() as a window function
  • Preserve one row per order

Expected concepts:

  • AVG() OVER (PARTITION BY customer_id)
  • Row-level enrichment with group context

Task 2: Statistical Window Functions

For each customer, determine the percent rank of each order based on order revenue.

Steps:

  1. Compute order revenue as
    \[\text{order\_revenue} = \sum (\text{quantity} \times \text{price})\]
  2. Apply PERCENT_RANK() within each customer

Expected concepts:

  • Distribution within partitions
  • Relative positioning of rows

Task 3: Value-from-Another-Row Functions

For each customer, compute the difference between the current order and the previous order in time.

  • Orders must be compared chronologically
  • The first order per customer should return NULL for the difference

Expected concepts:

  • LAG()
  • Time-aware ordering
  • Change detection

Task 4: Ranking Functions

Within each city, rank customers by their total lifetime spend.

Requirements:

  • Customers with the same spend must receive the same rank
  • No gaps should appear in ranking

Expected concepts:

  • DENSE_RANK()
  • Partitioning by geography
  • Ranking based on aggregated metrics

Task 5: String Aggregation Window Functions

For each customer, construct a chronological history of product categories they have purchased.

Requirements:

  • Categories must be ordered by order date
  • The full history must appear on every row for that customer

Expected concepts:

  • STRING_AGG() with window framing
  • Ordered categorical sequencing
  • Behavioral traceability

Task 6: Channel Behavior Pattern (Advanced)

Create a result showing:

  • A channel usage pattern (e.g. online, store)
  • The number of customers exhibiting that pattern

Rules:

  • Do not sort channels alphabetically
  • Let the order emerge naturally from purchase behavior
  • Customers using only one channel must appear as well

Expected concepts:

  • String aggregation
  • Grouping by derived behavioral signatures
  • Sensitivity to ordering

Submission Guidelines

  • Write one query per task
  • Each query must be executable independently
  • Focus on correctness, not formatting
  • Prefer clarity over cleverness

This assignment evaluates your ability to reason across rows, not just aggregate them.