Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. SQL
  2. SQL
  3. Session 05: DA with SQL | Numeric Functions
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Slides
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • From Session 4
    • Running Docker
    • Create sales_analysis table
    • Indexes for Filtering Performance
  • Numeric Functions
  • SUM()
    • SUM() Basic Aggregation
    • SUM() and NULL Values | Default Behavior
    • SUM() with GROUP BY | Aggregation by Category
    • Analytical Use Cases
  • AVG()
    • AVG() Basic Aggregation
    • AVG() and NULL Values | Default Behavior
    • AVG() with GROUP BY | Average per Category
    • AVG() with GROUP BY | Average Sales per Customer
    • Analytical Interpretation | When AVG() Is Appropriate
    • Analytical Limitation | Sensitivity to Outliers
    • AVG() vs Median | Conceptual Note
    • Analytical Use Cases
  • MIN() and MAX()
    • MIN() and MAX() | Basic Aggregation
    • MIN() and MAX() and NULL Values | Default Behavior
    • MIN() and MAX() with GROUP BY | Extremes per Category
    • MIN() and MAX() on Text Columns
    • Analytical Use Cases
  • COUNT()
    • COUNT() | Basic Row Counting
    • COUNT(column) | Counting Non-NULL Values
    • COUNT() with GROUP BY | Volume per Dimension
    • COUNT(DISTINCT) | Counting Unique Values
    • COUNT(DISTINCT) with GROUP BY | Unique Entities per Group
    • COUNT() for Duplicate Detection
    • COUNT() and NULL Values | Key Rules
    • Analytical Use Cases
  • Row-Level Arithmetic
    • Basic Calculation
    • Combining Multiple Columns
    • Order of Operations
    • Numeric Transformations
    • Row-Level Arithmetic vs Aggregation
    • Common Analytical Use Cases
  • CEILING() and FLOOR()
    • Thresholds and Ranges
    • Analytical Use Cases
    • CEILING() with GROUP BY | Revenue Ranges
    • Analytical Use Cases
  • COALESCE()
    • COALESCE() with Zero | Treat Missing as Zero
    • COALESCE() with Average | Mean Imputation
    • COALESCE() with Median | Robust Imputation
    • Average vs Median | How to Choose
    • Key Analytical Warning
  • Case Study Assignment | Sales Performance & Data Quality Analysis
    • Context
  • The Task
    • Business Questions to Address
    • Deliverables
    • Key Expectation
    • Final Note
  1. SQL
  2. SQL
  3. Session 05: DA with SQL | Numeric Functions

Session 05: DA with SQL | Numeric Functions

SQL
Numeric Functions

From Session 4

Before jumping to the built-in SQL Funcions let’s ensure that we have sales_analysis table.

Running Docker

Lets run the Docker in detached mode:

docker compose up -d

Create sales_analysis table

DROP TABLE IF EXISTS sales_analysis;

CREATE TABLE sales_analysis AS
SELECT
    s.transaction_id,

    o.order_date,
    DATE(o.order_date) AS order_date_date,
    o.year,
    o.quarter,
    o.month,

    c.customer_name,
    c.city,
    c.zip_code,

    p.product_name,
    p.category,
    p.price,

    e.first_name AS employee_first_name,
    e.last_name  AS employee_last_name,
    e.salary     AS employee_salary,

    s.quantity,
    s.discount,
    s.total_sales
FROM sales AS s
JOIN orders AS o
    ON s.order_id = o.order_id
JOIN customers AS c
    ON s.customer_id = c.customer_id
JOIN products AS p
    ON s.product_id = p.product_id
LEFT JOIN employees AS e
    ON s.employee_id = e.employee_id;

Indexes for Filtering Performance

CREATE INDEX idx_sales_analysis_order_date
    ON sales_analysis(order_date_date);

CREATE INDEX idx_sales_analysis_year
    ON sales_analysis(year);

CREATE INDEX idx_sales_analysis_city
    ON sales_analysis(city);

CREATE INDEX idx_sales_analysis_category
    ON sales_analysis(category);
Note

Note, that we are going to take closer look on JOINs during the next session.

Numeric Functions

This section provides a clean, end-to-end coverage of numeric functions, organized from basic to advanced, with a strong analytical mindset.

For each function, you will see:

  • a conceptual input table
  • the SQL query
  • the output table
  • common analytical use cases
  • analytical interpretation

All examples assume the sales_analysis table.

SUM()

SUM() is one of the most frequently used SQL functions.

SUM() Basic Aggregation

SUM() adds numeric values across multiple rows.

Input

transaction_id total_sales
1001 120.50
1002 80.00
1003 99.50

\[\downarrow\]

SELECT
  SUM(total_sales) AS total_revenue
FROM sales_analysis;

\[\downarrow\]

total_revenue
300.00

SUM() and NULL Values | Default Behavior

By default, SUM() ignores NULL values.

Input

transaction_id total_sales
1001 120.50
1002 NULL
1003 99.50

\[\downarrow\]

SELECT
  SUM(total_sales) AS total_revenue
FROM sales_analysis;

\[\downarrow\]

total_revenue
220.00

Key points

  • NULL values do not contribute to the sum
  • NULL does not mean zero
  • if all values are NULL, the result of SUM() is NULL

Handling missing values explicitly will be covered later in a dedicated chapter.


SUM() with GROUP BY | Aggregation by Category

Grouping allows SUM() to operate per dimension.

Input

category total_sales
Accessories 50.00
Accessories 70.00
Electronics 120.00

\[\downarrow\]

SELECT
  category,
  SUM(total_sales) AS total_revenue
FROM sales_analysis
GROUP BY category;

\[\downarrow\]

category total_revenue
Accessories 120.00
Electronics 120.00

Analytical Use Cases

  • total revenue calculation
  • total cost or volume
  • high-level KPI reporting

  • revenue by product category
  • cost by department
  • KPI breakdowns for dashboards
WarningAnalytical Warning | Always Check the Grain

Before using SUM(), always verify the grain of the table.

  • sales_analysis is at the transaction level
  • summing values duplicated per transaction (e.g. salaries) produces incorrect results
  • aggregation must match the entity you want to measure

Correct use of SUM() depends as much on data structure as on SQL syntax.

AVG()

AVG() is a numeric aggregate function used to calculate the arithmetic mean of a set of values.

In analytics, averages are commonly used to understand typical behavior, but they must be interpreted carefully—especially in the presence of NULL values, outliers, and skewed distributions.


AVG() Basic Aggregation

AVG() calculates the sum of values divided by the number of non-NULL observations.

Input

transaction_id discount
1001 0.10
1002 0.00
1003 0.20

\[\downarrow\]

SELECT
  AVG(discount) AS avg_discount
FROM sales_analysis;

\[\downarrow\]

avg_discount
0.10

The average discount is calculated as:

\[ \frac{0.10 + 0.00 + 0.20}{3} = 0.10 \]


AVG() and NULL Values | Default Behavior

By default, AVG() ignores NULL values in both the numerator and the denominator.

Input

transaction_id discount
1001 0.10
1002 NULL
1003 0.20

\[\downarrow\]

SELECT
  AVG(discount) AS avg_discount
FROM sales_analysis;

\[\downarrow\]

avg_discount
0.15

Explanation

  • only non-NULL values are used
  • denominator is 2, not 3

\[ \frac{0.10 + 0.20}{2} = 0.15 \]

Important

This behavior is often unexpected for beginners and must be understood clearly.


AVG() with GROUP BY | Average per Category

Grouping allows AVG() to compute averages per dimension.

Input

category price
Accessories 25.00
Accessories 35.00
Electronics 120.00

\[\downarrow\]

SELECT
  category,
  AVG(price) AS avg_price
FROM sales_analysis
GROUP BY category;

\[\downarrow\]

category avg_price
Accessories 30.00
Electronics 120.00

AVG() with GROUP BY | Average Sales per Customer

A common analytical use case is average behavior per entity.

SELECT
  customer_name,
  AVG(total_sales) AS avg_transaction_value
FROM sales_analysis
GROUP BY customer_name;

This answers questions such as:

  • what is the typical transaction value per customer?
  • who are high-value vs low-value customers?

Analytical Interpretation | When AVG() Is Appropriate

Use AVG() when:

  • data is symmetrically distributed
  • extreme values are rare
  • you want a representative “typical” value

Typical examples:

  • average order value
  • average discount rate
  • average price per category

Analytical Limitation | Sensitivity to Outliers

AVG() is highly sensitive to outliers.

Example scenario

total_sales
20
25
30
500

The average is pulled upward by a single extreme value.

In such cases, the average may misrepresent typical behavior.


AVG() vs Median | Conceptual Note

  • AVG() reflects the mean
  • the median reflects the middle value

Use:

  • AVG → for stable, normally distributed data
  • Median → for skewed data or income-like variables

Median calculations will be covered later in the course.

WarningAnalytical Warning | Always Check the Distribution

Before relying on AVG(), consider:

  • presence of outliers
  • skewness of the data
  • whether “typical” behavior is meaningful

Blindly using averages is one of the most common analytical mistakes.

Analytical Use Cases

  • average order value
  • average discount per transaction
  • average price by category
  • benchmarking typical performance

  • AVG() computes the arithmetic mean
  • NULL values are ignored by default
  • grouping enables per-dimension averages
  • averages are sensitive to outliers
  • distribution shape matters for interpretation
Important

Understanding when not to use AVG() is just as important as knowing how to write it.

MIN() and MAX()

MIN() and MAX() are aggregate functions used to identify extreme values in a dataset.

They answer the questions:

  • What is the smallest value?
  • What is the largest value?

These functions are widely used in exploratory analysis, data validation, and boundary checks.


MIN() and MAX() | Basic Aggregation

MIN() returns the lowest value, and MAX() returns the highest value in a column.

Input

transaction_id total_sales
1001 120.50
1002 80.00
1003 99.50

\[\downarrow\]

SELECT
  MIN(total_sales) AS min_sale,
  MAX(total_sales) AS max_sale
FROM sales_analysis;

\[\downarrow\]

min_sale max_sale
80.00 120.50

MIN() and MAX() and NULL Values | Default Behavior

Both MIN() and MAX() ignore NULL values by default.

Input

transaction_id total_sales
1001 120.50
1002 NULL
1003 99.50

\[\downarrow\]

SELECT
  MIN(total_sales) AS min_sale,
  MAX(total_sales) AS max_sale
FROM sales_analysis;

\[\downarrow\]

min_sale max_sale
99.50 120.50

Key points

  • NULL values do not participate in comparisons
  • if all values are NULL, the result is NULL

MIN() and MAX() with GROUP BY | Extremes per Category

Grouping allows you to find minimum and maximum values per dimension.

Input

category price
Accessories 25.00
Accessories 35.00
Electronics 120.00

\[\downarrow\]

SELECT
  category,
  MIN(price) AS min_price,
  MAX(price) AS max_price
FROM sales_analysis
GROUP BY category;

\[\downarrow\]

category min_price max_price
Accessories 25.00 35.00
Electronics 120.00 120.00

MIN() and MAX() on Text Columns

MIN() and MAX() also work on text (string) columns.

For text values:

  • MIN() returns the alphabetically first value
  • MAX() returns the alphabetically last value

Input

product_name
Cable
Mouse
Keyboard

\[\downarrow\]

SELECT
  MIN(product_name) AS first_product,
  MAX(product_name) AS last_product
FROM sales_analysis;

\[\downarrow\]

first_product last_product
Cable Mouse

This behavior is based on lexicographical (alphabetical) ordering.


Analytical Use Cases

  • identifying cheapest and most expensive products
  • finding first and last dates in time-based data
  • validating numeric ranges
  • detecting unexpected extreme values
  • alphabetical boundary checks for text data

WarningAnalytical Warning | Extremes Are Not Typical

MIN() and MAX() show boundaries, not typical behavior.

  • they are highly sensitive to outliers
  • a single abnormal value can dominate the result

Always pair extremes with other statistics such as:

  • AVG()
  • COUNT()
  • distribution analysis

to get a complete picture.


MIN() and MAX() are essential tools for understanding the limits of your data, but they should never be used in isolation.

COUNT()

COUNT() is one of the most fundamental SQL aggregate functions.
It is used to measure volume, cardinality, and data completeness.

Unlike SUM() or AVG(), COUNT() does not depend on numeric magnitude.
It answers a simpler—but critically important—question:

How many rows or values exist?

In analytics, incorrect use of COUNT() is a common source of silent errors, especially when NULLs or duplicates are involved.


COUNT() | Basic Row Counting

COUNT(*) counts the number of rows returned by a query.

Input

transaction_id discount
1001 0.10
1002 NULL
1003 0.20

\[\downarrow\]

SELECT
  COUNT(*) AS total_rows
FROM sales_analysis;

\[\downarrow\]

total_rows
3

COUNT(*) includes all rows, regardless of NULL values.


COUNT(column) | Counting Non-NULL Values

COUNT(column) counts only non-NULL values in a specific column.

Input

transaction_id discount
1001 0.10
1002 NULL
1003 0.20

\[\downarrow\]

SELECT
  COUNT(discount) AS non_null_discounts
FROM sales_analysis;

\[\downarrow\]

non_null_discounts
2

Key distinction

  • COUNT(*) → counts rows
  • COUNT(column) → counts existing values

COUNT() with GROUP BY | Volume per Dimension

Grouping allows COUNT() to measure volume by category.

Input

product_name transaction_id
Mouse 1001
Mouse 1002
Keyboard 1003

\[\downarrow\]

SELECT
  product_name,
  COUNT(*) AS transaction_count
FROM sales_analysis
GROUP BY product_name;

\[\downarrow\]

product_name transaction_count
Mouse 2
Keyboard 1

This is one of the most common analytical patterns in SQL.


COUNT(DISTINCT) | Counting Unique Values

COUNT(DISTINCT column) counts the number of unique, non-NULL values.

SELECT
  COUNT(DISTINCT customer_name) AS unique_customers
FROM sales_analysis;

This answers questions like:

  • how many unique customers exist?
  • how many unique products were sold?

COUNT(DISTINCT) with GROUP BY | Unique Entities per Group

SELECT
  city,
  COUNT(DISTINCT customer_name) AS unique_customers
FROM sales_analysis
GROUP BY city;

This pattern is commonly used for:

  • market reach analysis
  • customer distribution by geography
  • penetration metrics

COUNT() for Duplicate Detection

Duplicates are detected using COUNT() together with GROUP BY and HAVING.

Example | Duplicate Transactions

SELECT
  transaction_id,
  COUNT(*) AS duplicate_count
FROM sales_analysis
GROUP BY transaction_id
HAVING COUNT(*) > 1;

Any returned row indicates a duplicate transaction ID.


COUNT() and NULL Values | Key Rules

  • COUNT(*) counts rows, including those with NULLs
  • COUNT(column) ignores NULLs
  • COUNT(DISTINCT column) ignores NULLs and duplicates
  • if all values are NULL, COUNT(column) returns 0

Understanding these rules is essential for correct metrics.


WarningAnalytical Warning | Define What You Are Counting

Before using COUNT(), always ask:

  • am I counting rows or entities?
  • do I need DISTINCT?
  • is the table grain aligned with my question?

Counting the wrong thing leads to misleading KPIs even when SQL syntax is correct.


Analytical Use Cases

  • number of transactions
  • number of customers
  • data completeness checks
  • duplicate detection
  • denominator for rates and averages
Important

Mastering COUNT() is a prerequisite for trustworthy analytical SQL.

Row-Level Arithmetic

Row-level arithmetic refers to calculations performed on each individual row of a table.
Unlike aggregate functions (SUM, AVG, COUNT), these operations do not reduce rows, istead they create derived columns.

Row-level arithmetic is foundational for:

  • feature engineering
  • revenue and cost calculations
  • metric normalization
  • preparing data for aggregation

In practice, most analytical pipelines combine row-level calculations first, followed by aggregation.


Basic Calculation

A common example is calculating revenue at the transaction level.

Input

transaction_id quantity price
1001 2 25.00
1002 1 80.00

\[\downarrow\]

SELECT
  transaction_id,
  quantity * price AS calculated_revenue
FROM sales_analysis;

\[\downarrow\]

transaction_id calculated_revenue
1001 50.00
1002 80.00

Each row is processed independently, producing a new derived value.

Combining Multiple Columns

You can combine more than two columns in a single expression.

Input

transaction_id quantity price discount
1001 2 25.00 0.10
1002 1 80.00 0.00

\[\downarrow\]

SELECT
  transaction_id,
  quantity * price * (1 - discount) AS net_revenue
FROM sales_analysis;

\[\downarrow\]

transaction_id net_revenue
1001 45.00
1002 80.00

This pattern is common in pricing, discounting, and billing logic.

Order of Operations

SQL follows standard arithmetic precedence:

  1. parentheses: ()
  2. multiplication and division: * /
  3. addition and subtraction: + -

Use parentheses to make calculations explicit and readable.

SELECT
  transaction_id,
  (quantity * price) - (quantity * price * discount) AS net_revenue
FROM sales_analysis;

This is equivalent to the previous example but often clearer in business contexts.


Numeric Transformations

Row-level arithmetic is often combined with numeric functions such as ROUND, CEILING, or FLOOR.

SELECT
  transaction_id,
  ROUND(quantity * price, 2) AS rounded_revenue
FROM sales_analysis;

This is especially important for financial reporting.


Row-Level Arithmetic vs Aggregation

It is critical to distinguish between:

  • row-level calculations → preserve rows
  • aggregate calculations → reduce rows
SELECT
  SUM(quantity * price) AS total_revenue
FROM sales_analysis;

Here:

  • quantity * price is computed per row
  • SUM() aggregates those results into a single value

This pattern—calculate first, aggregate second—is a core analytical principle.

Common Analytical Use Cases

  • transaction-level revenue
  • net revenue after discounts
  • cost per unit
  • margin calculations
  • normalized metrics

WarningAnalytical Warning | Check NULL Propagation

In arithmetic expressions:

  • if any operand is NULL, the result is NULL
  • this can silently remove rows from later aggregations

Understanding NULL behavior is essential and will be revisited in a dedicated NULL-handling chapter.


Row-level arithmetic is the bridge between raw data and meaningful analytical metrics.

CEILING() and FLOOR()

CEILING() and FLOOR() are numeric functions used to round values to integer boundaries.

  • CEILING() rounds a value up to the nearest integer
  • FLOOR() rounds a value down to the nearest integer

These functions are commonly used for:

  • threshold logic
  • range construction
  • distribution analysis

Thresholds and Ranges

Input

transaction_id total_sales
1001 120.10
1002 80.90

\[\downarrow\]

SELECT
  transaction_id,
  CEILING(total_sales) AS rounded_up,
  FLOOR(total_sales)   AS rounded_down
FROM sales_analysis;

\[\downarrow\]

transaction_id rounded_up rounded_down
1001 121 120
1002 81 80

Here:

  • CEILING() is useful when values must meet a minimum threshold
  • FLOOR() is useful when values must not exceed a maximum threshold

Analytical Use Cases

  • billing thresholds (minimum chargeable units)
  • rounding prices or quantities for operational rules
  • constructing numeric ranges
  • discretizing continuous values

CEILING() with GROUP BY | Revenue Ranges

CEILING() can be used to group continuous numeric values into fixed-width ranges, which is a common technique in exploratory and descriptive analytics.

In this example, transaction revenues are grouped into 50-unit revenue ranges.


Input (sales_analysis — simplified view)

transaction_id total_sales
1001 12.40
1002 47.80
1003 52.10
1004 79.90
1005 101.25
1006 138.60

Step 1 | Range Assignment Logic

The expression below assigns each row to a range:

\[ \text{revenue\_range} = \lceil \frac{\text{total\_sales}}{50} \rceil \times 50 \]

This means:

  • values from 0 < x ≤ 50 → range 50
  • values from 50 < x ≤ 100 → range 100
  • values from 100 < x ≤ 150 → range 150

Query

SELECT
  CEILING(total_sales / 50.0) * 50 AS revenue_range,
  COUNT(*) AS transactions,
  SUM(total_sales) AS total_revenue
FROM sales_analysis
GROUP BY CEILING(total_sales / 50.0) * 50
ORDER BY revenue_range;

Intermediate Mapping (Conceptual)

transaction_id total_sales revenue_range
1001 12.40 50
1002 47.80 50
1003 52.10 100
1004 79.90 100
1005 101.25 150
1006 138.60 150

Output

revenue_range transactions total_revenue
50 2 60.20
100 2 132.00
150 2 239.85

How to Interpret the Output

Each row represents a revenue range, not individual transactions.

For example:

  • revenue_range = 100 represents all transactions where
    \[50 < \text{total\_sales} \le 100\]
  • transactions shows how many transactions fall into that range
  • total_revenue shows the sum of sales within the range

Analytical Use Cases

  • distribution analysis
  • revenue concentration analysis
  • identifying dominant transaction ranges
  • preparing binned metrics for dashboards
  • exploratory analysis before modeling

WarningAnalytical Warning | Ranges Are Design Choices

Ranges created with CEILING() or FLOOR() are analytical assumptions.

  • different range widths lead to different interpretations
  • overly wide ranges hide structure
  • overly narrow ranges introduce noise

Always justify:

  • range size
  • boundary logic
  • business meaning

CEILING() combined with GROUP BY is a powerful technique for turning continuous numeric data into interpretable analytical summaries, but its effectiveness depends on thoughtful range design.

COALESCE()

COALESCE() is a SQL function used for explicit NULL handling.
It replaces NULL values with a specified substitute, allowing you to control how missing data is treated before aggregation or analysis.

Handling NULL values explicitly is a critical analytical step, because different substitution strategies lead to different interpretations and KPIs.

Common strategies include:

  • replacing NULL with 0
  • replacing NULL with the average
  • replacing NULL with the median

Each choice has distinct analytical implications.

Note

By default, most aggregate functions ignore NULL values.

COALESCE() lets you override this behavior intentionally.

COALESCE() with Zero | Treat Missing as Zero

Use this approach only when business logic clearly defines NULL as zero,

for example:

  • no discount applied
  • no quantity sold
  • no activity recorded

Input

transaction_id discount
1001 0.10
1002 NULL
1003 0.20

\[\downarrow\]

SELECT
  AVG(COALESCE(discount, 0)) AS avg_discount_with_zeros
FROM sales_analysis;

\[\downarrow\]

avg_discount_with_zeros
0.10

Interpretation

\[ (0.10 + 0 + 0.20) / 3 = 0.10 \]

When to Use

  • NULL truly means zero impact
  • KPIs explicitly require zero inclusion
  • operational metrics (counts, volumes)

COALESCE() with Average | Mean Imputation

In this approach, missing values are replaced with the overall average.

This preserves the mean level of the data but reduces variability, since all missing values become identical.

\[\downarrow\]

SELECT
  AVG(COALESCE(discount, avg_discount)) AS avg_discount_mean_imputed
FROM sales_analysis,

Conceptual Effect:

  • NULL values become the average
  • overall mean remains unchanged
  • variance is reduced

When to Use Average Imputation

  • distribution is approximately symmetric
  • no strong outliers exist
  • missing values are random and rare
  • reporting-level analysis

When NOT to Use

  • skewed distributions
  • presence of outliers
  • performance or incentive analysis

COALESCE() with Median | Robust Imputation

Median imputation is more robust than average imputation
because it is not affected by outliers.

PostgreSQL supports median calculation using PERCENTILE_CONT(0.5).

\[\downarrow\]

SELECT
  AVG(COALESCE(discount, median_discount)) AS avg_discount_median_imputed
FROM sales_analysis;
Warning

The function below is a window function, which will be covered in upcoming sessions.

PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY discount)

SELECT
         PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY discount) AS median_discount
FROM sales_analysis

When to Use Median Imputation

  • skewed distributions
  • presence of outliers
  • financial or behavioral metrics
  • fairness-sensitive analysis

When NOT to Use

  • very small datasets
  • when business rules require exact values
  • operational counting logic

Average vs Median | How to Choose

Scenario Prefer
Symmetric distribution Average
Skewed distribution Median
Presence of outliers Median
KPI reporting (business averages) Average
Robust or fairness analysis Median
Modeling or downstream ML Median

Key Analytical Warning

Imputation is not neutral.

  • it changes distributions
  • it affects KPIs
  • it must be documented

Before using COALESCE(), always ask:

  • what does NULL represent?
  • am I preserving reality or convenience?
  • will this decision influence decisions or incentives?

Important
  • COALESCE() gives you control over missing data
  • zero imputation is a business decision, not a default
  • average imputation preserves the mean
  • median imputation preserves robustness
  • choosing the wrong strategy leads to misleading insights

NULL handling is not just a technical detail: it is a core analytical responsibility.

Case Study Assignment | Sales Performance & Data Quality Analysis

Context

You are working as a data analyst for an e-commerce company.
Management relies on the sales_analysis table for reporting, performance tracking, and decision-making.

Recently, stakeholders raised concerns about:

  • inconsistent KPIs across dashboards
  • unclear handling of missing values
  • misleading averages in reports
  • lack of visibility into revenue distribution

Your task is to analyze, validate, and redesign key metrics using SQL numeric functions.

Update the tables by filling in with the NULLs


The Task

Using only SQL queries on the sales_analysis table, produce a single analytical result set (or a small set of related result sets) that answers the following business questions.

You may create intermediate queries or CTEs if needed.


Business Questions to Address

1. Revenue Overview

  • What is the total revenue of the company?
  • How is revenue distributed across product categories?
  • Which category contributes the largest share of revenue?

2. Typical Transaction Value

Management wants to understand a “typical” transaction.

  • Calculate the average transaction value
  • Calculate the median transaction value
  • Based on the data, explain which metric is more appropriate and why, in case of NULLs

(Hint: consider skewness)


3. NULL Impact Assessment

Discounts are inconsistently recorded.

  • How many transactions have NULL discounts?
  • Calculate average discount using:
    • default behavior
    • zero imputation
    • average imputation
    • median imputation
  • Explain how each approach changes interpretation
Note

We updated the discount values into null, make sure to do so in case you haven’t done it yet.


4. Revenue Distribution Analysis

To improve pricing strategy, management wants to understand revenue ranges.

  • Group transactions into 50-unit revenue ranges
  • For each range, compute:
    • number of transactions
    • total revenue
  • Identify the dominant revenue range

5. Data Quality Check

Before finalizing KPIs:

  • Check for duplicate transaction IDs
  • Explain the risk of aggregating employee salary directly from this table
  • Identify one additional potential data quality risk in sales_analysis

Deliverables

Submit one SQL-driven analysis that includes:

  • SQL queries (PostgreSQL syntax)
  • clear column naming
  • logically structured output
  • brief written interpretation (1–2 sentences per section)
ImportantSync with Github
  • create numeric_functions.sql file in you queries folder
  • once finished push you work into Github by:
    • git add queries/numeric_functions.sql
    • git commit -m "adding numeric functions"
    • git push

Key Expectation

This is not a syntax exercise.

You are expected to demonstrate:

  • analytical judgment
  • metric design thinking
  • awareness of data limitations

Final Note

A correct query can still produce a wrong insight.

Your goal is to produce results that management can trust.

These numeric patterns form a critical foundation for data analysis with SQL.

Practical Rule of Thumb

  • average ≈ median → distribution is likely symmetric
  • average > median → right-skewed
  • average < median → left-skewed
Tip

Remember to review this part from Intro to Statistics