SQL for Data Analysis
2026-04-01
By the end of this section, you will be able to:
Before diving into more SQL syntax, let’s pause to understand what happens behind the scenes when a SQL query is executed.
Think about an e-commerce company, similar to our products–sales case study.
The same idea applies to SQL queries.
Important
What you write is only part of the story; how the database executes it matters just as much.
SQL is a declarative programming language.
This means:
Important
The database engine (RDBMS) decides the most efficient execution strategy.
This statement tells the database:
product_nameproducts tableIt does not specify:
All of this is handled internally by the database engine.
When a query is submitted:
Writing efficient SQL means trusting the database, while still writing clear and well-structured queries.
As a data analyst:
Understanding this process helps you:
As queries grow more complex and datasets become larger, performance matters.
Poorly optimized queries can:
Faster queries mean:
A query plan shows how the database intends to execute a query.
Most RDBMSs (including PostgreSQL) estimate:
You generate a query plan using EXPLAIN.
PostgreSQL returns a plan, not the data.
Example output:
Sequential scans are common when:
Reducing query cost often starts with requesting less data.
The foundation of almost every analytical query is SELECT.
Basic structure:
Example using products:
Selecting only needed columns:
ORDER BY sorts results in ascending or descending order.
Common use cases:
Default (ascending):
Descending (Z \(\rightarrow\) A):
Numeric sorting:
SELECT and FROMDESC reverses orderLIMIT restricts the number of rows returned.
Useful when:
\[\downarrow\]
Important
LIMIT must be at the endGROUP BY groups rows sharing the same values.
It is commonly used with aggregate functions:
COUNT()SUM()AVG()MIN()MAX()\[\downarrow\]
Returns top 5 products by revenue.
DISTINCT returns unique values.
Syntax:
Multiple columns:
Both can return unique combinations.
DISTINCT:
GROUP BY:
Use:
DISTINCT for simplicityGROUP BY when aggregation is neededHAVING is used to filter aggregated results \[\rightarrow\] filters groups after aggregation.WHERE filters rows before aggregationThis distinction is critical for analytical queries.
WHERE \(\rightarrow\) filters raw rowsGROUP BY \(\rightarrow\) creates groupsHAVING \(\rightarrow\) filters aggregated groupsYou cannot use aggregate functions in WHERE.
SQL is written top-down, but executed differently.
Logical order (simplified):
FROMWHEREGROUP BYHAVINGSELECTORDER BYLIMITThis explains why HAVING exists.
HAVING always works together with GROUP BY.
Suppose you want products that generated more than 10,000 in total revenue.
\[\downarrow\]
Using WHERE (incorrect):
This query fails because SUM() is not available at the WHERE stage.
You can (and often should) use both.
WHERE filters rows earlyHAVING filters aggregated resultsThis is more efficient than using HAVING alone.
Find products with at least 50 transactions.
This filters on multiple aggregated metrics.
Use HAVING when:
SUM, COUNT, AVG, etc.Do not use HAVING when row-level filtering is sufficient.