Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. SQL
  2. SQL
  3. Session 02: Intro to PostgreSQL
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Slides
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Sales Analytics Database
  • Learning Goals
  • Data Storage Structures
  • Keys and Indexes
    • Keys
  • Indexes
    • When Not to Create Indexes
    • Single-Column Index
    • Composite Index
    • Unique Index
  • Data Types
  • Data Types in PostgreSQL
    • Numeric Data Types
    • Character Data Types
    • Date & Time Data Types
    • Boolean & Categorical Data Types
    • Special & Advanced Data Types | OPTIONAL
  • Data Dictionary
  • DDL: Data Definition Language
    • Common DDL Statements
    • CREATE (DDL Context)
    • ALTER
    • DROP
    • TRUNCATE
  • CRUD: Data Manipulation in Practice
    • CREATE (CRUD \(\rightarrow\) INSERT)
    • READ (SELECT)
    • UPDATE
    • DELETE
  • Constraints
    • UNIQUE Constraint
    • NOT NULL Constraint
    • PRIMARY KEY Constraint
    • FOREIGN KEY Constraint
    • CHECK Constraint
  • SQL Rules and Best Practices
    • Numbers and Underscores
    • Capitalization
    • Commenting
    • Aliasing
  1. SQL
  2. SQL
  3. Session 02: Intro to PostgreSQL

Session 02: Intro to PostgreSQL

SQL
Data Types
Schemas
Keys
Indexes
Syntax

Sales Analytics Database

Imagine you’ve just joined a company as a data analyst.
The company sells products through an online channel and stores its operational data in a PostgreSQL relational database.

The database captures information about:

  • customers who place orders
  • products that are sold
  • employees involved in the sales process
  • orders placed over time
  • individual sales transactions

Your role as an analyst is to query this data to answer business questions related to:

  • revenue and sales performance
  • product popularity
  • customer behavior
  • employee contribution
  • time-based trends

Learning Goals

In the previous session, you learned about different types of databases and the database management systems used to define, query, and manage them.

Without databases, you wouldn’t be able to retrieve, update, or delete data.
Before analysis can begin, however, data must first be stored, structured, and related correctly.

In this session, we will learn about the core building blocks of a relational database:

  • tables, rows, and columns
  • keys and relationships
  • indexes and performance trade-offs
  • data types in PostgreSQL
  • database schemas and analytical modeling concepts

By the end of this session, you should be able to read and understand a relational schema and write more efficient SQL queries.

Tip

Check out how to use pgAdmin 4 for better navigation.


Data Storage Structures

A relational database consists of one or more tables made up of rows and columns, and these tables are linked by relationships.

In this course, we will work with a sales analytics database that includes tables such as:

  • sales
  • orders
  • products
  • customers
  • employees

Each table stores information about a specific business entity, and relationships between tables allow us to combine this information during analysis.

The list on the left represents all the columns in a table. Each column represents a specific attribute of a transaction, such as total_sales, quantity, or product_id. Each row (record) represents a single transaction.

Tip

Another name for a record is a tuple.

We will revisit tuples when working with Python and relational data.

Tables are such a common structure that you’ve likely interacted with them without realizing it.
For example, when you place an order on an e-commerce website like Amazon:

  • Each row represents a different order or transaction
  • Columns represent attributes such as:
    • customer name
    • address
    • product ordered
    • price
    • order date

To illustrate, here’s an example order report.


Keys and Indexes

To store data efficiently and enable relationships between tables, relational databases rely on keys and indexes.

Keys

A key is a column (or set of columns) that uniquely identifies a row in a table.
Keys are essential for locating records and defining relationships between tables.

If you refer back to the sales table, you’ll notice a key icon next to the transaction_id column.
This indicates that transaction_id is a key.

In our database, each row in the sales table represents a single transaction line, uniquely identified by transaction_id.

Primary Key

A primary key uniquely identifies each record in a table.
Each table can have only one primary key, and it must:

  1. Be unique
  2. Never be NULL
  3. Remain stable when referenced by other tables

In the sales table, transaction_id is the primary key.

transaction_id order_id product_id customer_id total_sales
1001 501 12 3001 49.99
1002 501 18 3001 19.99
1003 502 12 3005 49.99
1004 503 25 3008 89.99

Candidate Key

A candidate key is any column (or set of columns) that can uniquely identify a row.

For example, in the customers table, customer_id and email columns are candidate keys.

This means both columns are candidate keys.

From the set of candidate keys, one is chosen as the primary key.

In this schema, customer_id is selected as the primary key, while email remains an alternative identifier.

customer_id customer_name email city
3001 Alice Johnson alice.johnson@email.com Berlin
3002 Mark Thompson mark.thompson@email.com Paris
3003 Elena Petrova elena.p@email.com Madrid
3004 David Chen david.chen@email.com London

Composite Key

A composite key is formed by combining two or more columns when no single column is sufficient to uniquely identify a row.

For example, (order_id, product_id) could uniquely identify rows in some transactional systems.

order_id product_id product_name quantity
501 12 Wireless Mouse 1
501 18 USB-C Cable 2
502 12 Wireless Mouse 1
503 25 Mechanical Keyboard 1

In this table:

  • order_id alone is not unique
  • product_id alone is not unique
  • The combination (order_id, product_id) uniquely identifies each row

This combination forms a composite key.

Surrogate Key

A surrogate key is a system-generated identifier with no business meaning.

The transaction_id column is a surrogate key; its only purpose is to uniquely identify each transaction.

transaction_id order_id product_id quantity total_sales
1001 501 12 1 49.99
1002 501 18 2 19.99
1003 502 12 1 49.99
1004 503 25 1 89.99

Surrogate keys are often used even when natural or composite keys exist.

They are preferred in analytical systems because they:

  • simplify joins
  • reduce index size
  • improve query readability
  • remain stable even if business rules change

Foreign Key

A foreign key is a column (or set of columns) that references a primary key in another table.

For example, the product_id column in the sales table references the product_id primary key in the products table.
Foreign keys establish relationships between tables and allow us to join data across entities.

Thus, Foreign key enforces referential integrity between the two tables which is crucial:

  • prevent invalid data (e.g., sales for non-existent products)
  • define how tables are related
  • enable joins between tables
  • preserve data consistency

Important

In analytical databases, foreign keys are sometimes not physically enforced for performance reasons, but they are always enforced logically in the data model.


Indexes

Imagine searching through a book to find all pages starting with the letter A.
Without an index, you would scan every page.
With an index (like a dictionary), you jump directly to the correct section.

Indexes work the same way in databases.

Indexes do not store new data.
They are special lookup structures that improve query performance.

NoteFull Table Scan

When a query checks every row instead of using an index, this is called a full scan.

Indexes improve read performance but come with trade-offs:

  • extra storage
  • slower inserts and updates
  • maintenance overhead

When Not to Create Indexes

  • Very small tables
  • Frequently updated columns
  • Columns with many NULL values
  • Columns with very low cardinality

Single-Column Index

A single-column index is built on one column.

Employee ID Name Contact Number Age
1 Max 800692692 24
2 Jessica 800123456 35
3 Mikeal 800745547 49

Composite Index

Consider the same employees table, but now imagine that queries often filter by both name and age at the same time. Thus, composite index spans multiple columns, such as (name, age).

Composite indexes are most effective when queries filter columns in the same left-to-right order as the index definition.

employee_id name age department
1 Max 24 Sales
2 Jessica 35 Marketing
3 Max 35 Finance

For an index defined as (name, age):

  • Efficient for queries filtering on name
  • Efficient for queries filtering on name and age
  • Not efficient for queries filtering on age only

Unique Index

Unlike a regular index, a unique index enforces a rule, not just performance. A unique index ensures that values are not duplicated.

customer_id customer_name email
3001 Alice Johnson alice@email.com
3002 Mark Thompson mark@email.com
3003 Elena Petrova elena@email.com

In the aboive table:

  • email values are unique
  • No two customers share the same email address

Creating a unique index on email:

  • prevents duplicate emails
  • guarantees data integrity

Use a unique index when:

  • values must be unique across rows
  • the column is frequently used for lookup
  • uniqueness is part of business logic
ImportantAre Keys Also Indexes?

Keys vs Indexes

From a database engine and storage point of view, keys and indexes are closely related, but they are not the same thing.

  • A key defines a logical rule (uniqueness, relationships)
  • An index is a physical data structure stored on disk to speed up access

How PostgreSQL Handles Them

Concept Enforces Uniqueness Improves Query Speed Stored as Index
Primary Key Yes Yes Yes (unique index)
Unique Constraint Yes Yes Yes (unique index)
Foreign Key No Sometimes No (by default)
Regular Index No Yes Yes

Important Notes

  • Creating a primary key automatically creates a unique index
  • Creating a unique constraint automatically creates a unique index
  • Foreign keys do NOT create indexes automatically
  • Indexes exist to improve performance; keys exist to enforce rules

Data Types

Data types define what kind of data a column can store.
They ensure consistency, enable validation, and affect performance.

Consider the following example:

Employee Name Salary ($)
Jan 100000
Alex One hundred thousand
Kim 70k
Caution

Inconsistent data types make analysis impossible.

To calculate averages or totals, the column must enforce a numeric type.


Data Types in PostgreSQL

  1. Numeric
  2. Character
  3. Date & Time
  4. Boolean & Categorical
  5. Identifier & Auto-Increment
  6. JSON and Semi-Structured
  7. Array
  8. Special Data Types

Numeric Data Types

In PostgreSQL, numeric data types fall into three main groups:

  1. Integers: whole numbers
  2. Exact numerics: precise decimal numbers
  3. Floating-point numbers: approximate decimals

Numeric data types are used for:

  • counts
  • quantities
  • prices
  • measurements

Main Numeric Data Types

Data Type Description Example Value Memory Size Typical Use Case
SMALLINT Small-range whole numbers 12 2 bytes Status codes, flags
INTEGER / INT Standard whole numbers 2500 4 bytes Quantities, counts
BIGINT Very large whole numbers 9876543210 8 bytes IDs, large counters
NUMERIC(p, s) Exact precision decimal NUMERIC(10,2) → 15432.75 Variable Salary, revenue
DECIMAL(p, s) Same as NUMERIC (SQL standard) DECIMAL(8,3) → 123.456 Variable Financial data
REAL Approximate floating-point 3.14159 4 bytes Scientific values
DOUBLE PRECISION Higher-precision floating-point 0.0000123456789 8 bytes Statistical calculations
Note

Notice how exact numerics (NUMERIC, DECIMAL) preserve precision, while floating-point types (REAL, DOUBLE PRECISION) may introduce rounding differences.

Performance vs Precision Trade-off

When choosing a numeric data type, you must balance precision, performance, and storage efficiency.
The table below summarizes the most common analytical requirements and the recommended PostgreSQL data types.

Requirement Precision Needed Performance Priority Recommended Data Type
Row counts, simple counters Exact Very high INTEGER
Large identifiers (IDs) Exact High BIGINT
Financial values (salary, revenue) Exact Medium NUMERIC(p,s)
Prices with decimals Exact Medium DECIMAL(p,s)
Ratios, averages, KPIs Approximate acceptable High DOUBLE PRECISION
Scientific measurements Approximate acceptable Very high REAL / DOUBLE PRECISION

Index Size Considerations

Indexes inherit the storage and performance characteristics of the column type they are built on:

  • Index on INTEGER \(\rightarrow\) small \(\rightarrow\) cache-friendly
  • Index on NUMERIC \(\rightarrow\) larger \(\rightarrow\) slower to scan
  • Composite indexes amplify the effect of column size

ImportantBest Practice

Use the smallest numeric type that safely fits your data.

  • Prefer INTEGER over BIGINT if values fit
  • Use NUMERIC only when precision matters
  • Avoid premature use of NUMERIC for IDs or counters

Character Data Types

Character data types are used to store textual information such as names, descriptions, identifiers, and free-form text.

In PostgreSQL, character data types fall into three main groups:

  1. Fixed-length characters — text with a fixed size
  2. Variable-length characters — text with flexible size limits
  3. Unbounded text — long or unpredictable text

Character data types are used for:

  • names
  • descriptions
  • emails
  • codes
  • comments

Main Character Data Types

Data Type Description Example Value Memory Size Typical Use Case
CHAR(n) Fixed-length character string CHAR(5) → ‘US’ n bytes (fixed) Country codes, fixed formats
VARCHAR(n) Variable-length string with limit VARCHAR(50) → ‘Karen Hovhannisyan’ Variable (≤ n) Names, emails
TEXT Variable-length string, no limit ‘This product has been discontinued.’ Variable Descriptions, comments
CHARACTER VARYING(n) SQL-standard name for VARCHAR CHARACTER VARYING(20) → ‘A123XZ’ Variable (≤ n) Codes, identifiers
Note

VARCHAR(n) and TEXT behave almost identically in PostgreSQL.
The main difference is whether you want to enforce a maximum length.

Performance vs Flexibility Trade-off

When choosing a character data type, the trade-off is usually between data validation and flexibility, not raw performance.

Requirement Length Control Needed Flexibility Priority Recommended Data Type
Fixed-format values Yes (fixed) Low CHAR(n)
Names, emails Yes (reasonable limit) Medium VARCHAR(n)
Free-form text No High TEXT
User comments, notes No Very high TEXT

Index Size Considerations

Indexes on character columns depend on string length and value distribution:

  • Index on CHAR(n) \(\rightarrow\) predictable size \(\rightarrow\) stable performance
  • Index on VARCHAR(n) \(\rightarrow\) variable size \(\rightarrow\) generally efficient
  • Index on TEXT \(\rightarrow\) potentially large \(\rightarrow\) slower scans for long values
  • Long strings in composite indexes increase index size significantly

ImportantBest Practice

Use constraints, not oversized text types, to control data quality.

  • Prefer VARCHAR(n) when a reasonable maximum length exists
  • Use TEXT for descriptions and comments
  • Avoid CHAR(n) unless values are truly fixed-length

Date & Time Data Types

Date and time data types are used to store temporal information, such as when an event occurred, started, or ended.

They are critical for analytics involving time series, trends, seasonality, and durations.

In PostgreSQL, date and time data types fall into four main groups:

  1. Date-only values
  2. Time-only values
  3. Timestamps (date + time)
  4. Time intervals / durations

Date & time data types are used for:

  • event dates
  • timestamps
  • log records
  • durations
  • time-based analysis

Main Date & Time Data Types

Data Type Description Example Value Memory Size Typical Use Case
DATE Calendar date (no time) 2025-03-15 4 bytes Birth dates, order dates
TIME Time of day (no date) 14:30:00 8 bytes Opening hours
TIME WITH TIME ZONE Time with time zone info 14:30:00+04 12 bytes Cross-region schedules
TIMESTAMP Date and time (no timezone) 2025-03-15 14:30:00 8 bytes Local event logs
TIMESTAMP WITH TIME ZONE (TIMESTAMPTZ) Date and time with timezone handling 2025-03-15 10:30:00+00 8 bytes Auditing, analytics
INTERVAL Time duration 3 days 4 hours 16 bytes Session length, SLA
Note

TIMESTAMPTZ does not store the time zone itself.
PostgreSQL converts values to UTC internally and displays them using the session time zone.

Accuracy vs Interpretability Trade-off

When working with time data, the trade-off is often between absolute correctness and human interpretation.

Requirement Time Zone Awareness Recommended Data Type
Simple dates Not needed DATE
Local business timestamps Not required TIMESTAMP
Multi-region systems Required TIMESTAMPTZ
Durations and differences Not applicable INTERVAL

Index Size Considerations

Date and time types are fixed-size, making them efficient for indexing:

  • Index on DATE \(\rightarrow\) small \(\rightarrow\) very fast range scans
  • Index on TIMESTAMP \(\rightarrow\) small \(\rightarrow\) efficient sorting
  • Index on TIMESTAMPTZ \(\rightarrow\) small \(\rightarrow\) timezone-safe
  • Index on INTERVAL \(\rightarrow\) rarely indexed (use carefully)

Date-based indexes are commonly used with:

  • WHERE date BETWEEN …
  • GROUP BY date
  • time-window queries

ImportantBest Practice

Always use TIMESTAMPTZ for analytics and logging systems.

  • Avoid ambiguity in multi-region environments
  • Use DATE only when time is irrelevant
  • Store durations as INTERVAL, not numeric hacks

Boolean & Categorical Data Types

Boolean and categorical data types are used to store logical values and finite categories.
They are especially important for filtering, segmentation, and business rules.

In PostgreSQL, boolean and categorical data types fall into two main groups:

  1. Boolean values — true / false logic
  2. Categorical values — limited sets of labels

These data types are used for:

  • flags
  • status indicators
  • binary decisions
  • categorical segmentation

Main Boolean & Categorical Data Types

Data Type Description Example Value Memory Size Typical Use Case
BOOLEAN Logical true / false value TRUE, FALSE 1 byte Active flags, eligibility
CHAR(n) Fixed-length category code CHAR(1) → ‘Y’ n bytes (fixed) Yes/No indicators
VARCHAR(n) Short categorical label ‘premium’ Variable (≤ n) Customer segments
TEXT Free-form category label ‘high_value_customer’ Variable Tags, labels
ENUM Predefined set of values (‘low’,‘medium’,‘high’) 4 bytes Controlled categories
Caution

PostgreSQL accepts multiple boolean literals:

TRUE / FALSE, true / false, 1 / 0, yes / no.

Control vs Flexibility Trade-off

The main trade-off for categorical data is between data integrity and flexibility.

Requirement Validation Strictness Recommended Data Type
Binary logic Very strict BOOLEAN
Small fixed category set Very strict ENUM
Evolving categories Medium VARCHAR(n)
User-defined labels Low TEXT

Index Size Considerations

  • Index on BOOLEAN \(\rightarrow\) very small \(\rightarrow\) fast but often low selectivity
  • Index on short categorical strings \(\rightarrow\) efficient
  • Index on high-cardinality categories \(\rightarrow\) more useful than boolean indexes
  • Avoid indexing columns with very few distinct values
ImportantBest Practice

Use BOOLEAN only for true binary logic.

  • Do not encode booleans as ‘Y’/‘N’ unless required
  • Use ENUM only when categories are stable

Special & Advanced Data Types | OPTIONAL

Beyond standard numeric, text, and date types, PostgreSQL supports specialized data types designed for specific domains such as

  • geospatial analysis
  • networks
  • graphsand
  • semi-structured data.

These data types are typically used in advanced analytics, telecom, log analysis, and platform-scale systems.

Special data types are used for:

  • geospatial analytics
  • network analysis
  • semi-structured data
  • system-level identifiers
  • advanced domain modeling

Main Special Data Types (with Example Values)

Data Type / Extension Description Example Value Typical Use Case
JSON / JSONB Semi-structured JSON data {“plan”:“premium”,“usage”:120} Logs, APIs, configs
ARRAY Array of values {1,2,3} Tags, multi-valued attributes
UUID Universally unique identifier 550e8400-e29b-41d4-a716-446655440000 Distributed IDs
INET IP address 192.168.1.1 Network traffic
CIDR Network block 192.168.0.0/24 Subnet modeling
GEOMETRY (PostGIS) Geometric objects POINT(40.18 44.51) Maps, locations
GEOGRAPHY (PostGIS) Earth-based coordinates POINT(44.51 40.18) Distance calculations
ltree Hierarchical tree paths region.city.store Organizational trees
pgRouting Graph/network extension N/A Network routing, telecom
Note

Most advanced data types are provided via PostgreSQL extensions, not core SQL.

Domain-Specific Trade-offs

Special data types trade generality for domain power.

Requirement Domain Recommended Type
Flexible event payloads Logging / APIs JSONB
Multi-valued attributes Analytics ARRAY
Globally unique IDs Distributed systems UUID
IP & network data Telecom / IT INET, CIDR
Location-based analytics GIS PostGIS (GEOMETRY / GEOGRAPHY)
Graph traversal Networks pgRouting, graph models

Indexing & Performance Considerations

Special data types usually require specialized indexes:

  • JSONB \(\rightarrow\) GIN index
  • GEOMETRY / GEOGRAPHY \(\rightarrow\) GiST / SP-GiST
  • ARRAY \(\rightarrow\) GIN
  • INET / CIDR \(\rightarrow\) GiST

Improper indexing can make these types very slow on large datasets.

WarningPerformance Warning

Advanced data types without proper indexes often lead to full scans, even on indexed tables.

PostGIS (Geospatial Extension)

PostGIS adds geospatial intelligence to PostgreSQL.

Common capabilities:

  • Distance calculations
  • Spatial joins
  • Radius searches
  • Area coverage
Important

As a data analyst you are not expected to design PostGIS schemas or routing graphs — but you will query them.

Data Dictionary

A data dictionary documents metadata such as:

  • table names
  • column descriptions
  • keys and constraints
  • relationships

As an analyst, the data dictionary helps you understand where data lives and how to query it correctly.

Overview

This data dictionary describes the structure and meaning of the tables used in the sales analytics database.

  • table purpose
  • column definitions
  • data types
  • keys and relationships

Table: employees

Description
Stores information about employees involved in the sales process.

Column Name Data Type Key Description
employee_id SERIAL PK Unique identifier for each employee
first_name TEXT Employee first name
last_name TEXT Employee last name
email TEXT Employee email address
salary NUMERIC Employee salary

Table: customers

Description
Stores customer profile and location information.

Column Name Data Type Key Description
customer_id INTEGER PK Unique identifier for each customer
customer_name TEXT Full customer name
address TEXT Customer address
city TEXT City of residence
zip_code TEXT Postal / ZIP code

Table: products

Description
Contains product catalog information.

Column Name Data Type Key Description
product_id INTEGER PK Unique product identifier
product_name TEXT Name of the product
price NUMERIC Unit price of the product
description TEXT Product description
category TEXT Product category

Table: orders

Description
Stores order-level information and time attributes.

Column Name Data Type Key Description
order_id INTEGER PK Unique order identifier
order_date TIMESTAMP Date and time when the order was placed
year INTEGER Order year (derived)
quarter INTEGER Order quarter (derived)
month TEXT Order month name (derived)

Table: sales

Description
Central fact table storing individual sales transaction lines.

Column Name Data Type Key Description
transaction_id INTEGER PK Unique transaction identifier (surrogate key)
order_id INTEGER FK References orders(order_id)
product_id INTEGER FK References products(product_id)
customer_id INTEGER FK References customers(customer_id)
employee_id INTEGER FK References employees(employee_id)
total_sales NUMERIC Total sales value for the transaction
quantity INTEGER Number of units sold
discount NUMERIC Discount applied to the transaction

Relationships Summary

From Table Column To Table Column Relationship Type
sales order_id orders order_id Many-to-one
sales product_id products product_id Many-to-one
sales customer_id customers customer_id Many-to-one
sales employee_id employees employee_id Many-to-one

Analyst Notes

  • sales is the fact table
  • Other tables act as dimensions
  • transaction_id is a surrogate key
  • Time attributes are intentionally denormalized
  • Foreign keys may be logically enforced in analytical systems

DDL: Data Definition Language

DDL (Data Definition Language) is responsible for defining and managing the structure of a database.

DDL statements operate on database objects, not on individual rows.
They describe what the database looks like, not what data it contains.

DDL is foundational: every data operation you perform later depends on decisions made at the DDL level.

Common DDL Statements

DDL includes the following core commands:

  • CREATE
  • ALTER
  • DROP
  • TRUNCATE

Each of these affects database schema and metadata.

CREATE (DDL Context)

In the context of DDL, CREATE is used to define new database objects, such as tables and indexes.

Typical use cases include:

  • creating analytical tables
  • creating staging tables
  • creating indexes to improve performance

Example: create an index to support analytical queries.

CREATE INDEX idx_sales_product_id
ON sales (product_id);

This command does not modify data.
It changes how efficiently the database can access existing data.

ALTER

The ALTER statement modifies the structure of an existing database object.

Common use cases include:

  • adding columns
  • enforcing constraints
  • changing column properties
ALTER TABLE products
ADD CONSTRAINT chk_products_price
CHECK (price >= 0);

DROP

The DROP statement permanently removes a database object.

DROP INDEX idx_sales_product_id;
DROP TABLE products;

This operation is irreversible and should be used with extreme caution.

TRUNCATE

TRUNCATE removes all rows from a table while keeping its structure intact.

TRUNCATE TABLE sales_staging;

CRUD: Data Manipulation in Practice

CRUD describes how we work with data inside existing tables.
These are the SQL operations data analysts use most frequently.

CRUD stands for:

  • CREATE
  • READ
  • UPDATE
  • DELETE
ImportantCritical Terminology Clarification

The word CREATE means different things in different contexts.

  • DDL CREATE \(\rightarrow\) creates database objects (tables, indexes)
  • CRUD CREATE \(\rightarrow\) inserts new data (INSERT)

Confusing these two is one of the most common beginner mistakes.

NoteDDL vs CRUD — Conceptual Comparison
  • DDL \(\rightarrow\) defines the structure
  • CRUD \(\rightarrow\) manipulates the data
  • DDL \(\rightarrow\) happens rarely
  • CRUD \(\rightarrow\) happens constantly

DDL sets the rules.
CRUD must follow them.

CREATE (CRUD \(\rightarrow\) INSERT)

In the CRUD context, CREATE means adding new rows to an existing table.

This is done using the INSERT statement.

INSERT INTO products (product_id, product_name, price, category)
VALUES (101, 'Wireless Mouse', 24.99, 'Accessories');

You may omit columns that allow NULL or have default values.

INSERT INTO products (product_id, product_name, price)
VALUES (102, 'USB-C Cable', 9.99);

READ (SELECT)

The READ operation retrieves data and is implemented using SELECT.

SELECT
  product_id,
  product_name
FROM products;

Filtering rows is done using WHERE.

SELECT
  *
FROM sales
WHERE total_sales < 50;

UPDATE

The UPDATE statement modifies existing records.

Updates should always target specific rows.

UPDATE products
SET price = 49.99
WHERE product_id = 12;

DELETE

The DELETE statement removes records from a table.

Because deletions are irreversible, this command must be used with extreme caution.

DELETE FROM sales
WHERE transaction_id = 1004;
ImportantMental Model to Remember
  • DDL \(\rightarrow\) defines tables, columns, constraints, indexes
  • CRUD \(\rightarrow\) inserts, reads, updates, deletes rows
  • DML \(\rightarrow\) SQL language (INSERT, SELECT, UPDATE, DELETE) implementing CRUD

The first part of DML will be introduced in Session 3.
You can revisit it here: Session 3 DML Basics

NoteTo Quote or Not to Quote?

When writing SQL statements, make sure to place quotes around non-numeric values such as text and dates.

For example, in an INSERT command:

  • text values like title and description must be enclosed in quotes
  • numeric values like language_id and release_year should not

Different database systems handle quotes differently.

PostgreSQL accepts single quotes only (’’) for string literals, while some other systems allow both single and double quotes.

Constraints

I used the constraints above the lecture note, and decided to do a deep dive here as well :)

In this section, you’ll be looking at constraints, which play a crucial role in keeping your data organized.

Constraints specify what type of data a table or column can accept, and they are typically set when a table is created. When defined correctly, constraints:

  • enforce data integrity
  • prevent invalid or inconsistent data
  • act as built-in data quality checks

Below are the most common constraints you will encounter when designing relational databases.

UNIQUE Constraint

The UNIQUE constraint ensures that all values in a column are distinct.
It is commonly used to prevent duplicate records for attributes that must be unique across entities.

Typical use cases include:

  • email addresses
  • usernames
  • national identification numbers

Example: ensure each customer email is unique.

CREATE TABLE customers (
  customer_id INTEGER PRIMARY KEY,
  email TEXT UNIQUE,
  phone_number TEXT
);

With this constraint in place, PostgreSQL will reject any attempt to insert a duplicate email address.

NOT NULL Constraint

The NOT NULL constraint ensures that a column cannot contain empty (NULL) values.

Use this constraint for fields that are mandatory and must always be provided.

Typical use cases include:

  • primary identifiers
  • contact information
  • transaction timestamps

Example: ensure every customer has a phone number.

CREATE TABLE customers (
  customer_id INTEGER PRIMARY KEY,
  phone_number TEXT NOT NULL,
  email TEXT
);

If an insert is attempted without a phone number, PostgreSQL will return an error.

PRIMARY KEY Constraint

A PRIMARY KEY uniquely identifies each row in a table.

A primary key:

  • must be unique
  • cannot contain NULL values
  • exists only once per table

Example: define a primary key for a products table.

CREATE TABLE products (
  product_id INTEGER PRIMARY KEY,
  product_name TEXT NOT NULL,
  price NUMERIC(10, 2)
);

The PRIMARY KEY constraint automatically enforces both UNIQUE and NOT NULL.

FOREIGN KEY Constraint

A FOREIGN KEY creates a relationship between two tables by referencing the primary key of another table.

It ensures referential integrity, meaning that referenced values must exist in the parent table.

Example: link sales records to customers.

CREATE TABLE sales (
  transaction_id INTEGER PRIMARY KEY,
  customer_id INTEGER,
  total_sales NUMERIC(10, 2),
  FOREIGN KEY (customer_id)
    REFERENCES customers (customer_id)
);

With this constraint, you cannot insert a sale for a customer that does not exist in the customers table.

CHECK Constraint

The CHECK constraint restricts the range or condition of values that can be inserted into a column.

It validates data based on logical expressions.

Typical use cases include:

  • value ranges
  • minimum or maximum limits
  • domain rules

Example: ensure total sales values are non-negative.

CREATE TABLE sales (
  transaction_id INTEGER PRIMARY KEY,
  total_sales NUMERIC(10, 2) CHECK (total_sales >= 0)
);

If a value violates the condition, the insert or update will fail with an error.

SQL Rules and Best Practices

Now that you’ve seen the most common SQL statements, let’s get to grips with some basic rules and best practices for using SQL.
This list is not exhaustive—it is a starting point. As you gain hands-on experience, you will naturally pick up additional techniques and conventions.

Numbers and Underscores

In a relational database:

  • Table names and column names cannot start with a number
  • They cannot contain spaces
  • Use underscores (_) to separate words

This improves compatibility and readability.

Recommended

CREATE TABLE customer_lifecycle;
CREATE TABLE customer;

Not Recommended

CREATE TABLE customer lifecycle;
CREATE TABLE 3customer;

Capitalization

A widely accepted convention is:

  • SQL keywords → UPPERCASE
  • Table names → lowercase
  • Column names → lowercase

This makes queries easier to scan and understand.

Recommended

SELECT product_id, product_name
FROM products;
SELECT transaction_id, total_sales
FROM sales;

Not Recommended

Select PRODUCT_ID, PRODUCT_NAME From PRODUCTS;
SELECT TRANSACTION_ID, TOTAL_SALES FROM SALES;

Commenting

It is good practice to comment your SQL code so that other analysts and developers can easily read and understand your queries.

Comments do not affect query execution—they exist purely for documentation and readability.

Use -- for single-line comments.

-- select and display the first 10 rows from the sales table
SELECT *
FROM sales
LIMIT 10;

Multi-line comments can also be used to temporarily disable blocks of SQL code during testing or debugging.

/*
SELECT
  s.transaction_id,
  p.product_name,
  s.total_sales
FROM sales s
JOIN products p
  ON s.product_id = p.product_id
WHERE s.total_sales > 100;
*/

Aliasing

Aliases assign temporary names to columns or tables within a query.

They are especially useful for:

  • improving readability
  • shortening long names
  • resolving ambiguity in joins

Column aliases rename columns in the query output.

SELECT
  product_name AS item_name,
  price AS unit_price
FROM products;

Table aliases are commonly used in joins.

SELECT
  s.transaction_id,
  p.product_name,
  s.total_sales
FROM sales AS s
JOIN products AS p
  ON s.product_id = p.product_id;
Tip

Here you will find much more tips