Session 03: DA with SQL | Data Types & Constraints
Behind the Scenes
Before diving into more SQL syntax, let’s pause for a moment to consider what happens behind the scenes when a SQL query is executed.
To understand why this matters, think about an e-commerce company that sells products online (similar to our case study). If management wants to reduce operational costs or improve customer experience, they must analyze and optimize many background processes—inventory management, order fulfillment, payment processing, and delivery.
The same idea applies when running SQL queries.
What you write is only part of the story; how the database executes it matters just as much.
SQL as a Declarative Language
The first key concept to understand is that SQL is a declarative programming language.
This means:
- You tell the database what result you want
- You do not tell it how to get that result
When you write a SELECT statement, you describe the desired output, and the database engine (RDBMS) decides the most efficient way to retrieve the data.
Consider the following Declarative Query:
SELECT product_name
FROM products;This statement tells the database:
- return all values from the
product_namecolumn - read the data from the
productstable
What it does not specify:
- which indexes to use
- how the data should be scanned
- whether to read data sequentially or via an index
All of these decisions are handled internally by the database engine.
Query Optimization Happens Automatically
When a query is submitted:
- The database parses the SQL statement
- It checks syntax and permissions
- The query planner evaluates multiple execution strategies
- The optimizer selects the most efficient execution plan
- The query is executed and results are returned
Because of this process, part of writing efficient SQL queries is trusting the database to do its job—while still writing clear, well-structured queries that allow the optimizer to work effectively.
Why This Matters for Data Analysts
As a data analyst, this means:
- You focus on business logic and correctness
- The database focuses on performance and execution
- Clean SQL enables better optimization
- Proper indexes and constraints support faster queries
Understanding what happens behind the scenes will help you:
- interpret slow queries
- write more efficient SQL
- collaborate more effectively with data engineers and database admistrators
This foundation becomes especially important as datasets grow larger and queries become more complex.
If this query plan looks confusing right now. You don’t need to understand every detail. In fact, the only element you need to take note of is the cost. Now, this cost doesn’t refer to the actual cost of running the query, rather it’s an arbitrary unit that’s used to indicate the estimated run time of the query. The cost, or time, of returning the first row is 0, but the cost of returning all the rows is 64. Since the cost unit doesn’t refer to a specific second or minute, all this tells us is that a query with a cost of 64 will take longer than a query with a cost of 30.
Query Plan
As you become more familiar with SQL, you will start combining multiple SQL commands into scripts.
When working with large databases—often containing millions or billions of records—it becomes essential to consider the time and cost of running those scripts.
For example, imagine working with a large transactional system that stores sales and product data.
Running a poorly optimized query on a frequently accessed table could significantly slow down reports and increase infrastructure costs. That’s why queries that run often should be optimized before being deployed.
Faster queries mean:
- lower computational cost
- better system performance
- quicker insights for decision-makers
What Is a Query Plan?
A query plan shows how the database intends to execute a SQL query.
Most relational database management systems (RDBMS), including PostgreSQL, can estimate:
- how long a query will take
- how many rows will be processed
- which operations will be performed (
scans,joins,filters)
To generate a query plan, you simply add the keyword EXPLAIN before your SQL query.
Suppose you want to retrieve all records from a sales table and understand how expensive this operation might be.
EXPLAIN
SELECT *
FROM sales;When this query is executed, PostgreSQL does not return the data itself.
Instead, it returns a query plan describing how the data would be retrieved.
A basic query plan might look like this:
"Seq Scan on sales (cost=0.00..92.00 rows=5000 width=34)"
This output tells us:
- Seq Scan: PostgreSQL is performing a sequential scan, reading every row in the table
- sales: the table being scanned
- cost: an internal estimate of startup and total execution cost
- rows: estimated number of rows returned
- width: average size of each row in bytes
A sequential scan is common when:
- the table is relatively small
- no suitable index exists
- most rows are expected to be returned
Try yourself
Understanding query plans helps you:
- diagnose slow queries
- decide when indexes are needed
- write more efficient SQL
- communicate performance issues to data engineers or DBAs
As datasets grow, even simple queries can become expensive.
Learning to read and reason about query plans is a key step toward writing production-ready SQL.
Query optimization is a huge topic, most of which goes beyond the scope of the bootcamp and the role of a data analyst. Just know that writing optimized queries is a good habit to get into, especially queries that will be executed frequently.
If you’d like to read more about query optimization:
Refining Your SQL Queries
We have learned that reducing the cost of your queries is important.
But how do you actually optimize SQL queries in practice?
One of the simplest and most effective ways is to be precise about what data you request.
Before introducing additional commands, it’s worth revisiting SELECT, because it is the foundation of almost every analytical query.
SELECT
As you already know, SELECT allows you to retrieve data from a database.
A SELECT statement usually follows this structure:
SELECT column_name_1,
column_name_2
FROM table_name;If you want to retrieve multiple columns, you list them separated by commas.
Consider the products table:
SELECT product_name,
price,
category
FROM products;This query tells the database to:
- return the
product_name - return the
price - return the
category - read the data from the
productstable
Only these three columns will be included in the result set.
Why Column Selection Matters?
Selecting only the columns you actually need:
- reduces the amount of data transferred
- improves query performance
- makes results easier to read and interpret
Compare the following two queries:
EXPLAIN
SELECT *
FROM products;EXPLAIN
SELECT product_name, price
FROM products;The second query is usually preferable when you only need product names and prices.
ORDER BY
The ORDER BY clause sorts query results in ascending or descending order.
It is commonly used when you want to view:
- the
firstorlastrecords highestorlowestvaluesalphabeticallyordered text data
By default, ORDER BY sorts results in ascending order.
SELECT column_name
FROM table_name
ORDER BY column_name;You can explicitly control the sorting direction using:
ASCfor ascending orderDESCfor descending order
Suppose you want to view products sorted by name in descending alphabetical order (Z \(\rightarrow\) A).
SELECT
product_name,
category
FROM products
ORDER BY product_name DESC;This query returns:
product_namecategory
sorted from Z to A, because product_name is a text (character) column.
If you sort a numeric column in descending order, results are ordered from largest to smallest.
SELECT
product_name,
price
FROM products
ORDER BY price DESC;This query returns the most expensive products first.
You can sort by more than one column. The database applies sorting from left to right.
SELECT
product_name,
category,
price
FROM products
ORDER BY category ASC, price DESC;This means:
- Products are grouped by
category(A \(\rightarrow\) Z) - Within each category, products are sorted by
price(high \(\rightarrow\) low)
To sum up:
ORDER BYis applied afterSELECTandFROM- Default sort order is ascending
- Use
DESCfor reverse ordering - Text columns are sorted alphabetically
- Numeric columns are sorted by value
- Multiple columns can be combined for fine-grained control
In the ORDER BY statment, instead of writing column names , we can simply provide its index.
For example:
SELECT
product_name,
category,
price
FROM products
ORDER BY 2 ASC, 3 DESC;LIMIT
The LIMIT clause restricts the number of rows returned by a query.
It is useful when:
- you only need a small sample of the data
- you want to preview results
- you want to reduce query cost and execution time
Basic Syntax:
SELECT column_name
FROM table_name
LIMIT number_of_rows;The value provided to LIMIT must always be a number.
Example:
Suppose you want to retrieve product names and prices, sorted by price from highest to lowest, but you only care about the top 10 most expensive products.
SELECT
product_name,
price
FROM products
ORDER BY price DESC
LIMIT 10;This query:
- sorts products by price in descending order
- returns only the first 10 rows
Using LIMIT in this way saves time and computational resources, especially when working with large tables.
LIMITis always written at the end of a query- it must be placed after
ORDER BY(if present) - the value must be numeric
If LIMIT is placed elsewhere in the query, PostgreSQL will return an error.
Not all SQL databases use LIMIT.
- PostgreSQL, MySQL, SQLite \(\rightarrow\)
LIMIT - SQL Server \(\rightarrow\)
TOP
GROUP BY
The GROUP BY clause is very popular among data analysts :)
The GROUP BY clause allows you to group rows that share the same values in one or more columns.
It is commonly used together with aggregate functions to summarize data.
Similar to Excel, SQL provides several aggregate functions that allow you to perform calculations across multiple rows of data.
Below is a list of commonly used aggregate functions.
Note that SUM and AVG can only be applied to numeric columns.
COUNT()counts the number of rowsMAX()finds the maximum value in a columnMIN()finds the minimum value in a columnSUM()adds up all numeric values in a columnAVG()calculates the average of numeric values in a column
We are going to learn more functions later.
Example | Grouping Sales by Product
Suppose you want to calculate total sales per product.
SELECT
product_id,
SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id;This query:
- groups rows by
product_id - calculates total revenue per product
- returns one row per product
Key Rules for GROUP BY
- Every column in
SELECTmust either:- appear in the
GROUP BYclause, or - be wrapped in an aggregate function
- appear in the
GROUP BYis evaluated afterWHEREand beforeORDER BY
Example | Grouping with Sorting and Limiting
SELECT
product_id,
SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id
ORDER BY total_revenue DESC
LIMIT 5;This query returns the top 5 products by total revenue.
LIMITcontrols how many rows are returnedGROUP BYcontrols how rows are aggregated- Both are essential tools for efficient, scalable data analysis
- They are often used together in real-world analytical queries
Example | Discovering Distinct Values
GROUP BY can also be used to identify distinct values in a column.
This is useful when you want to understand how data is organized without seeing every individual row.
To see all product categories in the database, you could start with:
SELECT
category
FROM products;This query returns the category for every product, which may include many duplicates. If you only want to know which categories exist, GROUP BY is the right tool.
SELECT
category,
COUNT(*)
FROM products
GROUP BY category;This query:
- groups rows by
category - returns one row per unique category
- removes duplicate category values from the output
The result is a concise summary of the different product categories stored in the database.
Using GROUP BY in this way helps you:
- explore the structure of a dataset
- validate categorical columns
- prepare data for aggregation and reporting
In practice, GROUP BY becomes even more powerful when combined with aggregate functions, which you’ll explore next.
Example | Sales per Pruduct
Aggregate functions are most powerful when combined with GROUP BY, which allows you to summarize data across categories. Suppose you are working with a sales department and want to understand how many sales transactions exist per product.
To begin, recall that the sales table may contain many rows per product.
SELECT
product_id,
COUNT(transaction_id) AS number_of_transactions
FROM sales
GROUP BY product_id;This query:
- groups rows by
product_id - counts how many transactions exist for each product
- returns one row per product
Example | Total Revenue per Product
You can also summarize numeric columns such as total revenue.
SELECT
product_id,
SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id;Here:
SUM(total_sales)adds up all sales values per product- the result shows total revenue generated by each product
Example | Average Price by Category
Aggregate functions are often used for higher-level analysis.
SELECT
category,
AVG(price) AS average_price
FROM products
GROUP BY category;This query:
- groups products by category
- calculates the average price within each category
GROUP BY with Multiple Columns
You can also group data by multiple columns.
To do this, simply list all grouping columns after GROUP BY, separated by commas.
This allows you to create more granular summaries by combining multiple dimensions.
Example | Grouping by Multiple Attributes
Suppose you want to understand how many sales transactions exist for each product and employee_id (later we will get the employee name) combination.
SELECT
product_id,
COUNT(transaction_id) AS transaction_count
FROM sales
GROUP BY product_id
ORDER BY COUNT(transaction_id) DESC;This query:
- groups records by both
product_idandemployee_id - counts the number of transactions in each combination
- returns one row per unique
(product_id, employee_id)pair
The output represents:
- one row per product and employee combination
- the total number of transactions for that specific combination
Grouping by multiple columns helps you:
- analyze performance across multiple dimensions
- break down metrics by category, region, channel, or time
- build detailed summaries for dashboards and reports
You will get more practice using multi-column GROUP BY as you move into data summarization, aggregation, and data cleaning tasks later in the course.
DISTINCT
The DISTINCT keyword is used to return unique values from a column or a combination of columns.
It is especially useful when a dataset contains duplicate values and you want to understand the unique categories or combinations present in the data.
Syntax
SELECT DISTINCT column_name
FROM table_name;Example | Finding Unique Values
Suppose you want to see all unique product categories in the products table.
SELECT DISTINCT category
FROM products;This query returns one row per unique category, even if multiple products belong to the same category.
DISTINCT with Multiple Columns
DISTINCT can also be applied to multiple columns to find unique combinations.
For example, to find unique combinations of product category and price:
SELECT DISTINCT
category,
price
FROM products;This returns each unique (category, price) pair found in the table.
DISTINCT with COUNT
SELECT DISTINCT
COUNT(category),
COUNT(DISTINCT category)
FROM products;DISTINCT vs GROUP BY
In many cases, DISTINCT and GROUP BY can produce the same result when no aggregate functions are used.
Using DISTINCT:
SELECT DISTINCT
category,
price
FROM products;Using GROUP BY:
SELECT
category,
price
FROM products
GROUP BY category, price;Both queries return the unique combinations of category and price.
When to Use DISTINCT
Use DISTINCT when:
- you only need unique values
- no aggregation is required
- you want simpler, more readable SQL
Use GROUP BY when:
- you need aggregate functions
- you want more control over grouping logic
GROUP BY vs DISTINCT
Which command do you think is faster: GROUP BY or DISTINCT?
Use EXPLAIN to generate query plans for both queries and compare their estimated costs to see which one runs more efficiently in PostgreSQL.
HAVING
Why HAVING Exists
The HAVING clause is designed to filter aggregated results.
In analytical queries, data is often grouped and summarized before it becomes meaningful.
At that stage, row-level filtering is no longer sufficient.
HAVING fills this gap.
WHERE\(\rightarrow\) filters rows before aggregationHAVING\(\rightarrow\) filters groups after aggregation
Understanding this distinction is essential for correct and efficient SQL.
WHERE vs HAVING
Think of SQL execution as a pipeline.
- raw rows are first filtered
- rows are then grouped
- aggregates are calculated
- groups are filtered again if needed
This leads to a clear separation of responsibilities:
WHERE\(\rightarrow\) row-level filteringGROUP BY\(\rightarrow\) aggregationHAVING\(\rightarrow\) group-level filtering
Logical Query Execution Order
Although SQL is written top-down, it is executed in a different logical order.
At a high level:
FROMWHEREGROUP BYHAVINGSELECTORDER BYLIMIT
This order explains why aggregate functions cannot appear in WHERE, but are allowed in HAVING.
Basic HAVING Syntax
A typical query using HAVING looks like this:
SELECT
group_column,
AGGREGATE_FUNCTION(column)
FROM table_name
GROUP BY group_column
HAVING AGGREGATE_FUNCTION(column) condition;Important points:
HAVINGis evaluated after aggregation- it almost always appears together with
GROUP BY - it works on aggregated values, not raw rows
HAVING Example | Filtering by Total Revenue
Suppose you want to identify products that generated more than 10,000 in total revenue.
SELECT
product_id,
SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id
HAVING SUM(total_sales) > 10000;This query:
- groups sales by product
- calculates total revenue per product
- keeps only products above the revenue threshold
HAVING vs WHERE — Common Mistake
Using WHERE with aggregate functions is not allowed.
Incorrect example:
SELECT
product_id,
SUM(total_sales)
FROM sales
WHERE SUM(total_sales) > 10000
GROUP BY product_id;This fails because SUM(total_sales) does not exist at the WHERE stage of execution.
Combining WHERE and HAVING
In real-world analytical queries, WHERE and HAVING are often used together.
Their roles are complementary:
WHEREremoves unnecessary rows earlyHAVINGapplies business rules after aggregation
SELECT
product_id,
SUM(total_sales) AS total_revenue
FROM sales
WHERE total_sales > 0
GROUP BY product_id
HAVING SUM(total_sales) > 10000;This approach is usually more efficient than relying on HAVING alone.
HAVING with COUNT
HAVING is frequently used with COUNT().
Example: find products with at least 50 transactions.
SELECT
product_id,
COUNT(transaction_id) AS transaction_count
FROM sales
GROUP BY product_id
HAVING COUNT(transaction_id) >= 50;Here:
- transactions are grouped by product
- only frequently sold products are retained
HAVING with Multiple Conditions
Multiple aggregate conditions can be combined using logical operators.
SELECT
product_id,
COUNT(transaction_id) AS transaction_count,
SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id
HAVING
COUNT(transaction_id) >= 50
AND SUM(total_sales) > 10000;This allows you to enforce compound business rules at the group level.
When to Use HAVING
Use HAVING when:
- filtering aggregated results
- applying thresholds to groups
- working with
SUM,COUNT,AVG,MIN, orMAX
Do not use HAVING when row-level filtering with WHERE is sufficient.
Homework
Assume the following:
- Tables
customers,products, andsalesalready exist - Their schemas and columns are already documented
- Your task is not discovery, but improvement, enforcement, and optimization
You must not recreate tables or re-document schemas.
Task 1 | Enforce Missing Business Rules with ALTER TABLE
Even with a documented schema, databases often lack enforced rules.
Apply the following constraints using ALTER TABLE:
Business rules:
- Employees emails must be unique
- Emoployee phone numbers must be mandatory
- Product prices must be non-negative
- Sales totals must be non-negative
ALTER TABLE employees
ADD CONSTRAINT uq_employees_email UNIQUE (email);ALTER TABLE employees
ALTER COLUMN phone_number SET NOT NULL;ALTER TABLE products
ADD CONSTRAINT chk_products_price CHECK (price >= 0);ALTER TABLE sales
ADD CONSTRAINT chk_sales_total CHECK (total_sales >= 0);Task 2 | Add a New Analytical Attribute
The business now wants to analyze sales performance by channel.
Add a new column to the sales table:
- Column name:
sales_channel - Allowed values:
'online','store'
ALTER TABLE sales
ADD COLUMN sales_channel TEXT;Add a constraint to enforce valid values.
ALTER TABLE sales
ADD CONSTRAINT chk_sales_channel
CHECK (sales_channel IN ('online', 'store'));Populate the column with sample values.
UPDATE sales
SET sales_channel = 'online'
WHERE transaction_id % 2 = 0;write an explanation
Task 3 | Add Indexes for Query Performance
Based on common analytical queries, create indexes on columns that are frequently used for (according to you): - joins - grouping - filtering
Create the following indexes:
CREATE INDEX idx_sales_product_id
ON sales (product_id);CREATE INDEX idx_sales_customer_id
ON sales (customer_id);CREATE INDEX idx_products_category
ON products (category);Task 4 | Validate Index Usage with EXPLAIN
Run EXPLAIN on an aggregation query that is commonly used in reporting.
EXPLAIN
SELECT
product_id,
SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id;Write a short interpretation:
- Is a sequential scan used?
- Does PostgreSQL leverage the index?
- Why might the planner choose this plan?
Task 5 | Reduce Query Cost by Refining SELECT
Rewrite a reporting query to avoid unnecessary data retrieval.
Original query:
SELECT *
FROM sales;Refined query:
SELECT
transaction_id,
product_id,
total_sales
FROM sales;Explain in 2–3 sentences:
- why this reduces cost
- when
SELECT *might still be acceptable
Task 6 | ORDER BY and LIMIT for Business Questions
Marketing wants to identify the top 5 products by total revenue.
Write a query that:
- aggregates sales
- sorts by revenue
- limits the output
SELECT
product_id,
SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id
ORDER BY total_revenue DESC
LIMIT 5;Run EXPLAIN and comment on:
- sorting cost
- whether indexes help in this case
Task 7 | DISTINCT vs GROUP BY (Efficiency Comparison)
Retrieve unique combinations of category and price using both approaches.
Using DISTINCT:
EXPLAIN
SELECT DISTINCT
category,
price
FROM products;Using GROUP BY:
EXPLAIN
SELECT
category,
price
FROM products
GROUP BY category, price;Answer briefly:
- Are the query plans similar?
- Which has lower estimated cost?
- Why might PostgreSQL optimize them the same way?
Task 8 | Constraint Enforcement Test
Attempt to violate at least two constraints you added earlier and observe the errors.
Examples:
UPDATE products
SET price = -5
WHERE product_id = 101;INSERT INTO customers (customer_id, email, phone_number)
VALUES (999, 'anna@example.com', '091000999');Explain:
- which constraint was triggered
- why this protects data quality
Task 9 | Reflection (Short Answer)
Answer briefly (3–5 sentences total):
- Which constraints provide the highest business value?
- Which index would you prioritize in a production environment?
- What signals tell you a query needs optimization?
Submission Rules
- No table creation
- No schema documentation
- Only
ALTER,CREATE INDEX,UPDATE,EXPLAIN,SELECT,WHERE,GROUP BY,HAVING,ORDER BY,LIMI - Include SQL and short written interpretations
- Remember you need to put the code as
.sqlfile - Add to staging by
git add . - Commit the changes by
git comit -m "meaningfull message" - Push to GitHub!
git push