Session 03: DA with SQL | Data Types & Constraints

SQL

Data Types

Schemas

Keys

Indexes

Syntax

Behind the Scenes

Before diving into more SQL syntax, let’s pause for a moment to consider what happens behind the scenes when a SQL query is executed.

To understand why this matters, think about an e-commerce company that sells products online (similar to our case study). If management wants to reduce operational costs or improve customer experience, they must analyze and optimize many background processes—inventory management, order fulfillment, payment processing, and delivery.

The same idea applies when running SQL queries.

Important

What you write is only part of the story; how the database executes it matters just as much.

SQL as a Declarative Language

The first key concept to understand is that SQL is a declarative programming language.

This means:

You tell the database what result you want
You do not tell it how to get that result

Important

When you write a SELECT statement, you describe the desired output, and the database engine (RDBMS) decides the most efficient way to retrieve the data.

Consider the following Declarative Query:

SELECT product_name
FROM products;

This statement tells the database:

return all values from the product_name column
read the data from the products table

What it does not specify:

which indexes to use
how the data should be scanned
whether to read data sequentially or via an index

All of these decisions are handled internally by the database engine.

Query Optimization Happens Automatically

When a query is submitted:

The database parses the SQL statement
It checks syntax and permissions
The query planner evaluates multiple execution strategies
The optimizer selects the most efficient execution plan
The query is executed and results are returned

Because of this process, part of writing efficient SQL queries is trusting the database to do its job—while still writing clear, well-structured queries that allow the optimizer to work effectively.

Why This Matters for Data Analysts

As a data analyst, this means:

You focus on business logic and correctness
The database focuses on performance and execution
Clean SQL enables better optimization
Proper indexes and constraints support faster queries

Understanding what happens behind the scenes will help you:

interpret slow queries
write more efficient SQL
collaborate more effectively with data engineers and database admistrators

This foundation becomes especially important as datasets grow larger and queries become more complex.

If this query plan looks confusing right now. You don’t need to understand every detail. In fact, the only element you need to take note of is the cost. Now, this cost doesn’t refer to the actual cost of running the query, rather it’s an arbitrary unit that’s used to indicate the estimated run time of the query. The cost, or time, of returning the first row is 0, but the cost of returning all the rows is 64. Since the cost unit doesn’t refer to a specific second or minute, all this tells us is that a query with a cost of 64 will take longer than a query with a cost of 30.

Query Plan

As you become more familiar with SQL, you will start combining multiple SQL commands into scripts.
When working with large databases—often containing millions or billions of records—it becomes essential to consider the time and cost of running those scripts.

For example, imagine working with a large transactional system that stores sales and product data.
Running a poorly optimized query on a frequently accessed table could significantly slow down reports and increase infrastructure costs. That’s why queries that run often should be optimized before being deployed.

Faster queries mean:

lower computational cost
better system performance
quicker insights for decision-makers

What Is a Query Plan?

A query plan shows how the database intends to execute a SQL query.

Most relational database management systems (RDBMS), including PostgreSQL, can estimate:

how long a query will take
how many rows will be processed
which operations will be performed (scans, joins, filters)

To generate a query plan, you simply add the keyword EXPLAIN before your SQL query.

Suppose you want to retrieve all records from a sales table and understand how expensive this operation might be.

EXPLAIN
SELECT *
FROM sales;

When this query is executed, PostgreSQL does not return the data itself.
Instead, it returns a query plan describing how the data would be retrieved.

A basic query plan might look like this:

"Seq Scan on sales  (cost=0.00..92.00 rows=5000 width=34)"

This output tells us:

Seq Scan: PostgreSQL is performing a sequential scan, reading every row in the table
sales: the table being scanned
cost: an internal estimate of startup and total execution cost
rows: estimated number of rows returned
width: average size of each row in bytes

A sequential scan is common when:

the table is relatively small
no suitable index exists
most rows are expected to be returned

Tip

Try yourself

Understanding query plans helps you:

diagnose slow queries
decide when indexes are needed
write more efficient SQL
communicate performance issues to data engineers or DBAs

As datasets grow, even simple queries can become expensive.
Learning to read and reason about query plans is a key step toward writing production-ready SQL.

Tip

Query optimization is a huge topic, most of which goes beyond the scope of the bootcamp and the role of a data analyst. Just know that writing optimized queries is a good habit to get into, especially queries that will be executed frequently.

If you’d like to read more about query optimization:

Refining Your SQL Queries

We have learned that reducing the cost of your queries is important.
But how do you actually optimize SQL queries in practice?

One of the simplest and most effective ways is to be precise about what data you request.
Before introducing additional commands, it’s worth revisiting SELECT, because it is the foundation of almost every analytical query.

SELECT

As you already know, SELECT allows you to retrieve data from a database.

A SELECT statement usually follows this structure:

SELECT column_name_1,
       column_name_2
FROM table_name;

If you want to retrieve multiple columns, you list them separated by commas.

Consider the products table:

SELECT product_name,
       price,
       category
FROM products;

This query tells the database to:

return the product_name
return the price
return the category
read the data from the products table

Only these three columns will be included in the result set.

Why Column Selection Matters?

Selecting only the columns you actually need:

reduces the amount of data transferred
improves query performance
makes results easier to read and interpret

Try yourself

Compare the following two queries:

EXPLAIN
SELECT *
FROM products;

EXPLAIN
SELECT product_name, price
FROM products;

The second query is usually preferable when you only need product names and prices.

ORDER BY

The ORDER BY clause sorts query results in ascending or descending order.
It is commonly used when you want to view:

the first or last records
highest or lowest values
alphabetically ordered text data

By default, ORDER BY sorts results in ascending order.

SELECT column_name
FROM table_name
ORDER BY column_name;

You can explicitly control the sorting direction using:

ASC for ascending order
DESC for descending order

Suppose you want to view products sorted by name in descending alphabetical order (Z \(\rightarrow\) A).

SELECT
  product_name,
  category
FROM products
ORDER BY product_name DESC;

This query returns:

product_name
category

sorted from Z to A, because product_name is a text (character) column.

If you sort a numeric column in descending order, results are ordered from largest to smallest.

SELECT
  product_name,
  price
FROM products
ORDER BY price DESC;

This query returns the most expensive products first.

You can sort by more than one column. The database applies sorting from left to right.

SELECT
  product_name,
  category,
  price
FROM products
ORDER BY category ASC, price DESC;

This means:

Products are grouped by category (A \(\rightarrow\) Z)
Within each category, products are sorted by price (high \(\rightarrow\) low)

To sum up:

ORDER BY is applied after SELECT and FROM
Default sort order is ascending
Use DESC for reverse ordering
Text columns are sorted alphabetically
Numeric columns are sorted by value
Multiple columns can be combined for fine-grained control

Tip

In the ORDER BY statment, instead of writing column names , we can simply provide its index.

For example:

SELECT
  product_name,
  category,
  price
FROM products
ORDER BY 2 ASC, 3 DESC;

LIMIT

The LIMIT clause restricts the number of rows returned by a query.

It is useful when:

you only need a small sample of the data
you want to preview results
you want to reduce query cost and execution time

Basic Syntax:

SELECT column_name
FROM table_name
LIMIT number_of_rows;

The value provided to LIMIT must always be a number.

Example:

Suppose you want to retrieve product names and prices, sorted by price from highest to lowest, but you only care about the top 10 most expensive products.

SELECT
  product_name,
  price
FROM products
ORDER BY price DESC
LIMIT 10;

This query:

sorts products by price in descending order
returns only the first 10 rows

Using LIMIT in this way saves time and computational resources, especially when working with large tables.

Important

LIMIT is always written at the end of a query
it must be placed after ORDER BY (if present)
the value must be numeric

If LIMIT is placed elsewhere in the query, PostgreSQL will return an error.

Database Compatibility Note

Not all SQL databases use LIMIT.

PostgreSQL, MySQL, SQLite \(\rightarrow\) LIMIT
SQL Server \(\rightarrow\) TOP

GROUP BY

The GROUP BY clause is very popular among data analysts :)

The GROUP BY clause allows you to group rows that share the same values in one or more columns.

It is commonly used together with aggregate functions to summarize data.

Similar to Excel, SQL provides several aggregate functions that allow you to perform calculations across multiple rows of data.

Below is a list of commonly used aggregate functions.
Note that SUM and AVG can only be applied to numeric columns.

COUNT() counts the number of rows
MAX() finds the maximum value in a column
MIN() finds the minimum value in a column
SUM() adds up all numeric values in a column
AVG() calculates the average of numeric values in a column

Note

We are going to learn more functions later.

Example | Grouping Sales by Product

Suppose you want to calculate total sales per product.

SELECT
  product_id,
  SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id;

This query:

groups rows by product_id
calculates total revenue per product
returns one row per product

Key Rules for GROUP BY

Every column in SELECT must either:
- appear in the GROUP BY clause, or
- be wrapped in an aggregate function
GROUP BY is evaluated after WHERE and before ORDER BY

Example | Grouping with Sorting and Limiting

SELECT
  product_id,
  SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id
ORDER BY total_revenue DESC
LIMIT 5;

This query returns the top 5 products by total revenue.

Tip

LIMIT controls how many rows are returned
GROUP BY controls how rows are aggregated
Both are essential tools for efficient, scalable data analysis
They are often used together in real-world analytical queries

Example | Discovering Distinct Values

GROUP BY can also be used to identify distinct values in a column.
This is useful when you want to understand how data is organized without seeing every individual row.

To see all product categories in the database, you could start with:

SELECT 
  category
FROM products;

This query returns the category for every product, which may include many duplicates. If you only want to know which categories exist, GROUP BY is the right tool.

SELECT 
  category,
  COUNT(*)
FROM products
GROUP BY category;

This query:

groups rows by category
returns one row per unique category
removes duplicate category values from the output

The result is a concise summary of the different product categories stored in the database.

Using GROUP BY in this way helps you:

explore the structure of a dataset
validate categorical columns
prepare data for aggregation and reporting

In practice, GROUP BY becomes even more powerful when combined with aggregate functions, which you’ll explore next.

Example | Sales per Pruduct

Aggregate functions are most powerful when combined with GROUP BY, which allows you to summarize data across categories. Suppose you are working with a sales department and want to understand how many sales transactions exist per product.

To begin, recall that the sales table may contain many rows per product.

SELECT
  product_id,
  COUNT(transaction_id) AS number_of_transactions
FROM sales
GROUP BY product_id;

This query:

groups rows by product_id
counts how many transactions exist for each product
returns one row per product

Example | Total Revenue per Product

You can also summarize numeric columns such as total revenue.

SELECT
  product_id,
  SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id;

Here:

SUM(total_sales) adds up all sales values per product
the result shows total revenue generated by each product

Example | Average Price by Category

Aggregate functions are often used for higher-level analysis.

SELECT
  category,
  AVG(price) AS average_price
FROM products
GROUP BY category;

This query:

groups products by category
calculates the average price within each category

GROUP BY with Multiple Columns

You can also group data by multiple columns.
To do this, simply list all grouping columns after GROUP BY, separated by commas.

This allows you to create more granular summaries by combining multiple dimensions.

Example | Grouping by Multiple Attributes

Suppose you want to understand how many sales transactions exist for each product and employee_id (later we will get the employee name) combination.

SELECT
  product_id,
  COUNT(transaction_id) AS transaction_count
FROM sales
GROUP BY product_id
ORDER BY COUNT(transaction_id) DESC;

This query:

groups records by both product_id and employee_id
counts the number of transactions in each combination
returns one row per unique (product_id, employee_id) pair

The output represents:

one row per product and employee combination
the total number of transactions for that specific combination

Grouping by multiple columns helps you:

analyze performance across multiple dimensions
break down metrics by category, region, channel, or time
build detailed summaries for dashboards and reports

You will get more practice using multi-column GROUP BY as you move into data summarization, aggregation, and data cleaning tasks later in the course.

DISTINCT

The DISTINCT keyword is used to return unique values from a column or a combination of columns.
It is especially useful when a dataset contains duplicate values and you want to understand the unique categories or combinations present in the data.

Syntax

SELECT DISTINCT column_name
FROM table_name;

Example | Finding Unique Values

Suppose you want to see all unique product categories in the products table.

SELECT DISTINCT category
FROM products;

This query returns one row per unique category, even if multiple products belong to the same category.

DISTINCT with Multiple Columns

DISTINCT can also be applied to multiple columns to find unique combinations.

For example, to find unique combinations of product category and price:

SELECT DISTINCT
  category,
  price
FROM products;

This returns each unique (category, price) pair found in the table.

DISTINCT with COUNT

SELECT DISTINCT
  COUNT(category),
  COUNT(DISTINCT category)
FROM products;

DISTINCT vs GROUP BY

In many cases, DISTINCT and GROUP BY can produce the same result when no aggregate functions are used.

Using DISTINCT:

SELECT DISTINCT
  category,
  price
FROM products;

Using GROUP BY:

SELECT
  category,
  price
FROM products
GROUP BY category, price;

Both queries return the unique combinations of category and price.

When to Use DISTINCT

Use DISTINCT when:

you only need unique values
no aggregation is required
you want simpler, more readable SQL

Use GROUP BY when:

you need aggregate functions
you want more control over grouping logic

GROUP BY vs DISTINCT

Which command do you think is faster: GROUP BY or DISTINCT?

Use EXPLAIN to generate query plans for both queries and compare their estimated costs to see which one runs more efficiently in PostgreSQL.

HAVING

Why HAVING Exists

The HAVING clause is designed to filter aggregated results.

In analytical queries, data is often grouped and summarized before it becomes meaningful.
At that stage, row-level filtering is no longer sufficient.

HAVING fills this gap.

WHERE \(\rightarrow\) filters rows before aggregation
HAVING \(\rightarrow\) filters groups after aggregation

Understanding this distinction is essential for correct and efficient SQL.

WHERE vs HAVING

Think of SQL execution as a pipeline.

raw rows are first filtered
rows are then grouped
aggregates are calculated
groups are filtered again if needed

This leads to a clear separation of responsibilities:

WHERE \(\rightarrow\) row-level filtering
GROUP BY \(\rightarrow\) aggregation
HAVING \(\rightarrow\) group-level filtering

Logical Query Execution Order

Although SQL is written top-down, it is executed in a different logical order.

At a high level:

FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
LIMIT

This order explains why aggregate functions cannot appear in WHERE, but are allowed in HAVING.

Basic HAVING Syntax

A typical query using HAVING looks like this:

SELECT
  group_column,
  AGGREGATE_FUNCTION(column)
FROM table_name
GROUP BY group_column
HAVING AGGREGATE_FUNCTION(column) condition;

Important points:

HAVING is evaluated after aggregation
it almost always appears together with GROUP BY
it works on aggregated values, not raw rows

HAVING Example | Filtering by Total Revenue

Suppose you want to identify products that generated more than 10,000 in total revenue.

SELECT
  product_id,
  SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id
HAVING SUM(total_sales) > 10000;

This query:

groups sales by product
calculates total revenue per product
keeps only products above the revenue threshold

HAVING vs WHERE — Common Mistake

Using WHERE with aggregate functions is not allowed.

Incorrect example:

SELECT
  product_id,
  SUM(total_sales)
FROM sales
WHERE SUM(total_sales) > 10000
GROUP BY product_id;

This fails because SUM(total_sales) does not exist at the WHERE stage of execution.

Combining WHERE and HAVING

In real-world analytical queries, WHERE and HAVING are often used together.

Their roles are complementary:

WHERE removes unnecessary rows early
HAVING applies business rules after aggregation

SELECT
  product_id,
  SUM(total_sales) AS total_revenue
FROM sales
WHERE total_sales > 0
GROUP BY product_id
HAVING SUM(total_sales) > 10000;

This approach is usually more efficient than relying on HAVING alone.

HAVING with COUNT

HAVING is frequently used with COUNT().

Example: find products with at least 50 transactions.

SELECT
  product_id,
  COUNT(transaction_id) AS transaction_count
FROM sales
GROUP BY product_id
HAVING COUNT(transaction_id) >= 50;

Here:

transactions are grouped by product
only frequently sold products are retained

HAVING with Multiple Conditions

Multiple aggregate conditions can be combined using logical operators.

SELECT
  product_id,
  COUNT(transaction_id) AS transaction_count,
  SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id
HAVING
  COUNT(transaction_id) >= 50
  AND SUM(total_sales) > 10000;

This allows you to enforce compound business rules at the group level.

When to Use HAVING

Use HAVING when:

filtering aggregated results
applying thresholds to groups
working with SUM, COUNT, AVG, MIN, or MAX

Do not use HAVING when row-level filtering with WHERE is sufficient.

Homework

Assume the following:

Tables customers, products, and sales already exist
Their schemas and columns are already documented
Your task is not discovery, but improvement, enforcement, and optimization

You must not recreate tables or re-document schemas.

Task 1 | Enforce Missing Business Rules with ALTER TABLE

Even with a documented schema, databases often lack enforced rules.

Apply the following constraints using ALTER TABLE:

Business rules:

Employees emails must be unique
Emoployee phone numbers must be mandatory
Product prices must be non-negative
Sales totals must be non-negative

ALTER TABLE employees
ADD CONSTRAINT uq_employees_email UNIQUE (email);

ALTER TABLE employees
ALTER COLUMN phone_number SET NOT NULL;

ALTER TABLE products
ADD CONSTRAINT chk_products_price CHECK (price >= 0);

ALTER TABLE sales
ADD CONSTRAINT chk_sales_total CHECK (total_sales >= 0);

Task 2 | Add a New Analytical Attribute

The business now wants to analyze sales performance by channel.

Add a new column to the sales table:

Column name: sales_channel
Allowed values: 'online', 'store'

ALTER TABLE sales
ADD COLUMN sales_channel TEXT;

Add a constraint to enforce valid values.

ALTER TABLE sales
ADD CONSTRAINT chk_sales_channel
CHECK (sales_channel IN ('online', 'store'));

Populate the column with sample values.

UPDATE sales
SET sales_channel = 'online'
WHERE transaction_id % 2 = 0;

Important

write an explanation

Task 3 | Add Indexes for Query Performance

Based on common analytical queries, create indexes on columns that are frequently used for (according to you): - joins - grouping - filtering

Create the following indexes:

CREATE INDEX idx_sales_product_id
ON sales (product_id);

CREATE INDEX idx_sales_customer_id
ON sales (customer_id);

CREATE INDEX idx_products_category
ON products (category);

Task 4 | Validate Index Usage with EXPLAIN

Run EXPLAIN on an aggregation query that is commonly used in reporting.

EXPLAIN
SELECT
  product_id,
  SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id;

Write a short interpretation:

Is a sequential scan used?
Does PostgreSQL leverage the index?
Why might the planner choose this plan?

Task 5 | Reduce Query Cost by Refining SELECT

Rewrite a reporting query to avoid unnecessary data retrieval.

Original query:

SELECT *
FROM sales;

Refined query:

SELECT
  transaction_id,
  product_id,
  total_sales
FROM sales;

Explain in 2–3 sentences:

why this reduces cost
when SELECT * might still be acceptable

Task 6 | ORDER BY and LIMIT for Business Questions

Marketing wants to identify the top 5 products by total revenue.

Write a query that:

aggregates sales
sorts by revenue
limits the output

SELECT
  product_id,
  SUM(total_sales) AS total_revenue
FROM sales
GROUP BY product_id
ORDER BY total_revenue DESC
LIMIT 5;

Run EXPLAIN and comment on:

sorting cost
whether indexes help in this case

Task 7 | DISTINCT vs GROUP BY (Efficiency Comparison)

Retrieve unique combinations of category and price using both approaches.

Using DISTINCT:

EXPLAIN
SELECT DISTINCT
  category,
  price
FROM products;

Using GROUP BY:

EXPLAIN
SELECT
  category,
  price
FROM products
GROUP BY category, price;

Answer briefly:

Are the query plans similar?
Which has lower estimated cost?
Why might PostgreSQL optimize them the same way?

Task 8 | Constraint Enforcement Test

Attempt to violate at least two constraints you added earlier and observe the errors.

Examples:

UPDATE products
SET price = -5
WHERE product_id = 101;

INSERT INTO customers (customer_id, email, phone_number)
VALUES (999, 'anna@example.com', '091000999');

Explain:

which constraint was triggered
why this protects data quality

Task 9 | Reflection (Short Answer)

Answer briefly (3–5 sentences total):

Which constraints provide the highest business value?
Which index would you prioritize in a production environment?
What signals tell you a query needs optimization?

Submission Rules

No table creation
No schema documentation
Only ALTER, CREATE INDEX, UPDATE, EXPLAIN, SELECT, WHERE, GROUP BY, HAVING, ORDER BY, LIMI
Include SQL and short written interpretations
Remember you need to put the code as .sql file
Add to staging by git add .
Commit the changes by git comit -m "meaningfull message"
Push to GitHub! git push