Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. SQL
  2. SQL
  3. Session 13: DA with SQL | Final Project
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Slides
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Final SQL Project | Kickoff
    • Project Goal
    • Step 1 | Select a Business Domain
    • Step 2 | Normalize the Data Model
    • Step 3 | Create a New Database and Schema
    • Step 4 | Create Tables (DDL)
    • Step 5 | Populate the Database
    • Step 6 | Create the ERD
    • Step 7 | Push the changes into GitHub
    • Step 8 | Define Analytical Questions
    • Step 9 | Construct a Denormalized Analytical Table
    • Step 10 | Visualization
    • Step 11 | Create a Data Dictionary
    • Step 12 | Final Review and Presentation
    • Evaluation Focus
  • Example Project
    • What Do We Have?
    • Goal
    • Creating Database
    • Create Schema
    • Creating the Table analytics._stg_rockbuster
    • Download CSV file
    • Inserting the Data Into analytics._stg_rockbuster
    • Data Modeling
    • Q0: What do we have?
    • Q1: What does one row represent?
    • Q2: Is this table normalized?
    • Q3: Does customer_email uniquely identify a customer?
    • Q4: Does customer_city determine customer_country?
    • Q5: Does title determine film attributes?
    • Q6: Is there a many-to-many relationship between films and actors?
    • Exploring the Raw Data
    • Total Rows
    • Distinct Customers
    • Distinct Films
    • Distinct Actors
    • Q7: What does the difference between total rows and distinct films indicate?
    • STEP 3: Logical Entity Derivation
    • Q8: Why do we not immediately create tables?
    • Deriving the Normalized Schema
    • Final Checks
    • Reflection
    • ERD
    • What is the general statistical summary of movie rental conditions?
    • How many customers are there per country?
    • How much revenue does each country generate?
    • How much revenue does each country generate?
    • What is the average revenue per customer by country?
  1. SQL
  2. SQL
  3. Session 13: DA with SQL | Final Project

Session 13: DA with SQL | Final Project

Project

Final SQL Project | Kickoff

Project Goal

Design and deliver a production-grade analytical SQL project that demonstrates:

  • Proper normalization
  • Relational and spatial modeling
  • Analytical querying
  • Documentation and reproducibility

This project simulates a real analytics workflow, from raw data to insights.


Step 1 | Select a Business Domain

Choose one primary business table that genuinely interests your team.

Examples:

  • Customer transactions
  • Product sales
  • Service usage events
  • Location-based activity

Mandatory requirement:

  • The domain must include geography
  • Geography can be:
    • Customer locations
    • Service areas
    • Store regions
    • Delivery zones

If unsure, propose your domain and validate it before proceeding.

Step 2 | Normalize the Data Model

Start from a single wide table (conceptual or real).

Example (not final):

  • customer_name
  • product_name
  • city
  • region
  • country
  • order_date
  • price

Normalize into logical entities:

  • Countries
  • Regions
  • Cities
  • Customers
  • Events / Orders / Activities
  • Products (if applicable)
  • Spatial boundaries

Rules:

  • Remove redundancy
  • Use surrogate or natural keys consistently
  • Enforce referential integrity

Step 3 | Create a New Database and Schema

Create a dedicated database for the project.

CREATE DATABASE final_sql_project; -- any meaningful name which goes according to your project

Create a dedicated schema.

CREATE SCHEMA analytics;
  • All tables must live under this schema
  • Never use public

Step 4 | Create Tables (DDL)

Define all tables explicitly:

  • Primary keys
  • Foreign keys
  • Constraints
  • Geometry types with SRID
CREATE TABLE analytics.example_table (
  id INT PRIMARY KEY,
  geom GEOMETRY(Point, 4326)
);
  • Spatial tables are mandatory
  • Use PostGIS correctly

Step 5 | Populate the Database

Populate tables using CSV-based loading.

  • Use COPY or staging tables
  • Data volume should be realistic: more that 10k rows
COPY analytics.table_name
FROM '/data/table_name.csv'
CSV HEADER;
  • Validate row counts
  • Validate referential integrity
  • Validate geometries using ST_IsValid

Step 6 | Create the ERD

Produce an Entity Relationship Diagram that shows:

  • Tables
  • Primary keys
  • Foreign keys
  • Cardinality

Step 7 | Push the changes into GitHub

Create a dedicated GitHub repository.

Repository structure example:

  • /queries
  • /init
  • /data
  • /erd
  • /docs
  • README.md
  • docker-comose.yaml
  • .env

Push:

  • DDL scripts
  • CSV files (or generation scripts)
  • ERD
  • Documentation
Warning

Commit early and often.


Step 8 | Define Analytical Questions

Before writing complex SQL, define business questions.

Examples:

  • Revenue by country / region / city
  • Customer distribution by geography
  • Top products per region
  • Customers outside expected boundaries
  • Growth trends over time

Each question should map to specific joins.

Step 9 | Construct a Denormalized Analytical Table

Create a flattened analytical table.

  • Join all necessary dimensions
  • One row per analytical grain (e.g., order, customer-day)
CREATE TABLE analytics.fact_orders AS
SELECT
  o.order_id,
  o.order_date,
  c.customer_id,
  ci.city_name,
  r.region_name,
  co.country_name,
  SUM(oi.quantity * p.price) AS revenue
FROM analytics.orders o
JOIN analytics.customers c ON o.customer_id = c.customer_id
JOIN analytics.cities ci ON c.city_id = ci.city_id
JOIN analytics.regions r ON ci.region_id = r.region_id
JOIN analytics.countries co ON r.country_id = co.country_id
JOIN analytics.order_items oi ON o.order_id = oi.order_id
JOIN analytics.products p ON oi.product_id = p.product_id
GROUP BY
  o.order_id, o.order_date,
  c.customer_id, ci.city_name,
  r.region_name, co.country_name;
  • This table will be reused later for:
    • Subqueries
    • Window functions
    • Advanced analytics

Step 10 | Visualization

Use any tool: SQL client charts is recommended

Focus on:

  • Geography-based insights
  • Trends
  • Comparisons

Step 11 | Create a Data Dictionary

Document every table and column.

Example:

  • Table name
  • Column name
  • Data type
  • Description
  • Constraints

Store as: Markdown (.md)

This is a professional requirement, not optional.


Step 12 | Final Review and Presentation

Deliverables:

  • SQL database
  • ERD
  • GitHub repository
  • Analytical tables
  • Answered business questions
  • Visuals
  • Data dictionary

Final step:

  • Present:
    • Design decisions
    • Normalization logic
    • Join strategy
    • Analytical findings

Evaluation Focus

  • Correct normalization
  • Correct joins
  • Proper use of geography
  • Clean SQL
  • Reproducibility
  • Analytical thinking

This project simulates real-world analytics work.

Example Project

This analysis provides data-driven insights into several aspects including revenue, customer distribution, and sales trends to guide the worldwide online service and optimize market strategy.

What Do We Have?

We have:

  • Fully denormalized tabl
  • Actor-level exploded (massive duplication)
  • No surrogate IDs
  • No foreign keys
  • No artificial primary keys

Goal

We have to answer the following questions:

  • What is the general statistic schema of movies rental condition?
  • What is the geographical distribution of Rockbustor’s customers?
  • Which countries comprise the main top 10 customers?
  • Which genres are the top best seller movies?
  • Where are the customers with the highest lifetime value based in?

Creating Database

Go to PgAdmin and open query tool

CREATE DATABASE rockbuster

Create Schema

Now open another QUERY TOOL after clicking on the rockbuser

Once you’re in the rockbuster query tool, start creating the schema:

CREATE SCHEMA analytics

Creating the Table analytics._stg_rockbuster

DROP TABLE IF EXISTS analytics._stg_rockbuster;

CREATE TABLE analytics._stg_rockbuster (
    rental_date           timestamp,
    return_date           timestamp,
    amount                numeric(5,2),
    payment_date          timestamp,

    customer_first_name   varchar(45),
    customer_last_name    varchar(45),
    customer_email        varchar(50),
    customer_active       boolean,
    customer_create_date  date,

    customer_address      varchar(50),
    customer_address2     varchar(50),
    customer_district     varchar(20),
    customer_postal_code  varchar(10),
    customer_phone        varchar(20),

    customer_city         varchar(50),
    customer_country      varchar(50),

    title                 varchar(255),
    description           text,
    release_year          int,
    rental_duration       int,
    rental_rate           numeric(4,2),
    length                int,
    replacement_cost      numeric(5,2),
    rating                varchar(10),
    language_name         varchar(20),

    category_name         varchar(25),

    actor_first_name      varchar(45),
    actor_last_name       varchar(45),

    store_number          int,

    staff_first_name      varchar(45),
    staff_last_name       varchar(45),
    staff_email           varchar(50)
);

Download CSV file

You can download the data from her.e

Place it under:

data/rockbuster/rockbuster_denormalized.csv

Inserting the Data Into analytics._stg_rockbuster

Once you have donloaded the csv data, lets try to ingest the data into SQL in order to continue the data modeling part

COPY analytics._stg_rockbuster
FROM '/docker-entrypoint-initdb.d/data/rockbuster/rockbuster_denormalized.csv'
CSV HEADER
NULL 'NULL';

Data Modeling

Once we have the flat file in SQL, we begin the logical normalization process.

Q0: What do we have?

We have movie rental transaction data including:

  • rental information
  • payment information
  • customer demographic data
  • geographic data
  • film metadata
  • actor-level exploded rows

The list of columns:

rental_date,return_date,amount,payment_date,customer_first_name,customer_last_name,customer_email,customer_active,customer_create_date,customer_address,customer_address2,customer_district,customer_postal_code,customer_phone,customer_city,customer_country,title,description,release_year,rental_duration,rental_rate,length,replacement_cost,rating,language_name,category_name,actor_first_name,actor_last_name,store_number,staff_first_name,staff_last_name,staff_email

Q1: What does one row represent?

One row represents:

  • One rental transaction
  • One payment
  • One film
  • One actor associated with that film

Because actors are stored directly in the row, the dataset is duplicated at the actor level.

Q2: Is this table normalized?

No!

  • Customer data repeats
  • Film attributes repeat
  • Country repeats for each city
  • Actor rows duplicate film data

This is a classic denormalized transactional dataset.

Q3: Does customer_email uniquely identify a customer?

Yes, most likely:

customer_email →
customer_first_name,
customer_last_name,
customer_active,
customer_create_date,
customer_address

Therefore, customer attributes must be separated.

Q4: Does customer_city determine customer_country?

Yes: This is a transitive dependency.

city → country

Country should not be stored in the same table as rental.

Q5: Does title determine film attributes?

Yes: Film should be its own entity.

title →
description,
release_year,
rental_duration,
rental_rate,
length,
replacement_cost,
rating,
language_name

Q6: Is there a many-to-many relationship between films and actors?

Yes:

  • One film has multiple actors
  • One actor appears in multiple films

Therefore, a bridge table is required.

Exploring the Raw Data

Before designing tables, we inspect duplication.

Total Rows

SELECT 
  COUNT(*) 
FROM analytics._stg_rockbuster;

\[\downarrow\]

\[80,115\]

Distinct Customers

SELECT 
  COUNT(DISTINCT customer_email)
FROM analytics._stg_rockbuster;

\[\downarrow\]

\[599\]

Distinct Films

SELECT 
  COUNT(DISTINCT title)
FROM analytics._stg_rockbuster;

\[\downarrow\]

\[955\]

Distinct Actors

SELECT 
  COUNT(DISTINCT actor_first_name || actor_last_name)
FROM analytics._stg_rockbuster;

\[\downarrow\]

\[199\]

Q7: What does the difference between total rows and distinct films indicate?

It indicates:

  • Film data is repeated multiple times
  • Actor explosion causes duplication
  • The table mixes multiple entity types

STEP 3: Logical Entity Derivation

  • Country
  • City
  • Address
  • Customer
  • Language
  • Film
  • Category
  • Actor
  • Store
  • Staff
  • Rental
  • Payment

Q8: Why do we not immediately create tables?

Because normalization starts with reasoning about dependencies, not writing SQL.

Important

Database design is a logical process first, technical process second.

Deriving the Normalized Schema

Now that we have logically identified the entities, we begin implementing the normalization process in SQL.

We will move from stable dimensions to transactional tables.

Geography Hierarchy

Q9: Why start with geography? A9: This avoids transitive dependency inside customer:

  • city \(\rightarrow\) country
  • address \(\rightarrow\) city

This avoids transitive dependency inside customer.

add postgis extention

DROP EXTENSION IF EXISTS postgis CASCADE;
CREATE EXTENSION postgis;
SELECT PostGIS_Version();

\[\downarrow\]

this should return: 3.4 USE_GEOS=1 USE_PROJ=1 USE_STATS=1

Country

Create country Table
DROP TABLE IF EXISTS analytics.country CASCADE;

CREATE TABLE analytics.country (
    country_id SERIAL PRIMARY KEY, --autoincrement
    country_name VARCHAR(50) UNIQUE NOT NULL,
    -- geometry 
);
Insert Into country Table
INSERT INTO analytics.country (country_name)
SELECT 
  DISTINCT 
  a.customer_country,
  b.geometry
FROM analytics._stg_rockbuster a
LEFT JOIN analytics._stg_countries b on (a.country_name=b.country_name)
WHERE customer_country IS NOT NULL;

Country Bounderies

Before running this, make sure to have in your data folder the country geojson, which you can download from here.

Create _stg_world_countries Table
CREATE TABLE analytics._stg_world_countries (
    country_name TEXT NOT NULL,
    country_code TEXT NOT NULL,
    geom geometry(MULTIPOLYGON, 4326) NOT NULL
);
Insert Into _stg_world_countries Table
INSERT INTO analytics._stg_world_countries (country_name, geom)
SELECT
    feature->‘properties’->>‘name’ AS country_name,
    feature->>‘id’ AS country_code,
    ST_SetSRID(
        ST_Multi(
            ST_CollectionExtract(
                ST_Force2D(
                    ST_MakeValid(
                        ST_GeomFromGeoJSON(feature->>‘geometry’)
                    )
                ),
            3)
        ),
        4326
    ) AS geom
FROM (
    SELECT jsonb_array_elements(data->‘features’) AS feature
    FROM (
        SELECT pg_read_file(‘/docker-entrypoint-initdb.d/data/rockbuster/countries.geo.json’)::jsonb AS data
    ) f
) sub;
View _stg_world_countries Table
SELECT 
    * 
FROM analytics._stg_world_countries 
LIMIT 10

City

Create city Table
DROP TABLE IF EXISTS analytics.city CASCADE;

CREATE TABLE analytics.city (
    city_id SERIAL PRIMARY KEY,
    city_name VARCHAR(50) NOT NULL,
    country_id INT REFERENCES analytics.country(country_id),
    UNIQUE (city_name, country_id)
);
Insert Into city Table
INSERT INTO analytics.city (city_name, country_id)
SELECT DISTINCT
    s.customer_city,
    c.country_id
FROM analytics._stg_rockbuster s
JOIN analytics.country c
  ON c.country_name = s.customer_country;
View city Table
SELECT 
    * 
FROM analytics.city 
LIMIT 10

Address

Create address Table
DROP TABLE IF EXISTS analytics.address CASCADE;

CREATE TABLE analytics.address (
    address_id SERIAL PRIMARY KEY,
    address VARCHAR(50),
    address2 VARCHAR(50),
    district VARCHAR(20),
    postal_code VARCHAR(10),
    phone VARCHAR(20),
    city_id INT REFERENCES analytics.city(city_id),
    UNIQUE(address, postal_code, city_id)
);
Insert Into address Table

INSERT INTO analytics.address (
    address,
    address2,
    district,
    postal_code,
    phone,
    city_id
)
SELECT DISTINCT
    s.customer_address,
    s.customer_address2,
    s.customer_district,
    s.customer_postal_code,
    s.customer_phone,
    c.city_id
FROM analytics._stg_rockbuster s
JOIN analytics.city c
  ON c.city_name = s.customer_city;
View address Table
SELECT 
    * 
FROM analytics.address 
LIMIT 10

Customer Entity

Q10: What uniquely identifies a customer?

A1: customer_email (natural key assumption for this exercise).

Create customer Table
DROP TABLE IF EXISTS analytics.customer CASCADE;

CREATE TABLE analytics.customer (
    customer_id SERIAL PRIMARY KEY,
    first_name VARCHAR(45),
    last_name VARCHAR(45),
    email VARCHAR(50) UNIQUE,
    active BOOLEAN,
    create_date DATE,
    address_id INT REFERENCES analytics.address(address_id)
);
Insert Into customer Table
INSERT INTO analytics.customer (
    first_name,
    last_name,
    email,
    active,
    create_date,
    address_id
)
SELECT DISTINCT
    s.customer_first_name,
    s.customer_last_name,
    s.customer_email,
    s.customer_active,
    s.customer_create_date,
    a.address_id
FROM analytics._stg_rockbuster s
JOIN analytics.address a
  ON a.address = s.customer_address;
View customer Table
SELECT 
    * 
FROM analytics.customer 
LIMIT 10

Language

Q11: What determines film attributes?

A11: title determines all film-related descriptive attributes.

  1. Language
Create language Table
DROP TABLE IF EXISTS analytics.language CASCADE;

CREATE TABLE analytics.language (
    language_id SERIAL PRIMARY KEY,
    language_name VARCHAR(20) UNIQUE
);
Insert Into language Table
INSERT INTO analytics.language (language_name)
SELECT DISTINCT language_name
FROM analytics._stg_rockbuster;
View language Table
SELECT 
    * 
FROM analytics.language 
LIMIT 10

Film

Create film Table
DROP TABLE IF EXISTS analytics.film CASCADE;

CREATE TABLE analytics.film (
    film_id SERIAL PRIMARY KEY,
    title VARCHAR(255),
    description TEXT,
    release_year INT,
    rental_duration INT,
    rental_rate NUMERIC(4,2),
    length INT,
    replacement_cost NUMERIC(5,2),
    rating VARCHAR(10),
    language_id INT REFERENCES analytics.language(language_id),
    UNIQUE(title, release_year)
);
Insert Into film Table
INSERT INTO analytics.film (
    title,
    description,
    release_year,
    rental_duration,
    rental_rate,
    length,
    replacement_cost,
    rating,
    language_id
)
SELECT DISTINCT
    s.title,
    s.description,
    s.release_year,
    s.rental_duration,
    s.rental_rate,
    s.length,
    s.replacement_cost,
    s.rating,
    l.language_id
FROM analytics._stg_rockbuster s
JOIN analytics.language l
  ON l.language_name = s.language_name;
View film Table
SELECT 
    * 
FROM analytics.film
LIMIT 10

Category

Create category Table
DROP TABLE IF EXISTS analytics.category CASCADE;

CREATE TABLE analytics.category (
    category_id SERIAL PRIMARY KEY,
    category_name VARCHAR(25) UNIQUE
);
Insert Into category Table
INSERT INTO analytics.category (category_name)
SELECT DISTINCT category_name
FROM analytics._stg_rockbuster;
View category Table
SELECT 
    * 
FROM analytics.category
LIMIT 10

Actor

Create actor Table
DROP TABLE IF EXISTS analytics.actor CASCADE;

CREATE TABLE analytics.actor (
    actor_id SERIAL PRIMARY KEY,
    first_name VARCHAR(45),
    last_name VARCHAR(45),
    UNIQUE(first_name, last_name)
);
Insert Into actor Table
INSERT INTO analytics.actor (first_name, last_name)
SELECT DISTINCT
    actor_first_name,
    actor_last_name
FROM analytics._stg_rockbuster;
View actor Table
SELECT 
    * 
FROM analytics.actor
LIMIT 10

Film Category Bridge

Create film_category Table

Many-to-Many Relationships Film-Category Bridge

DROP TABLE IF EXISTS analytics.film_category;

CREATE TABLE analytics.film_category (
    film_id INT REFERENCES analytics.film(film_id),
    category_id INT REFERENCES analytics.category(category_id),
    PRIMARY KEY (film_id, category_id)
);
Insert Into film_category Table
INSERT INTO analytics.film_category
SELECT DISTINCT
    f.film_id,
    c.category_id
FROM analytics._stg_rockbuster s
JOIN analytics.film f
  ON f.title = s.title
JOIN analytics.category c
  ON c.category_name = s.category_name;
View film_category Table
SELECT 
    * 
FROM analytics.film_category
LIMIT 10

Film Actor Bridge

Create film_actor Table
DROP TABLE IF EXISTS analytics.film_actor;

CREATE TABLE analytics.film_actor (
    film_id INT REFERENCES analytics.film(film_id),
    actor_id INT REFERENCES analytics.actor(actor_id),
    PRIMARY KEY (film_id, actor_id)
);
Insert Into film_actor Table
INSERT INTO analytics.film_actor
SELECT DISTINCT
    f.film_id,
    a.actor_id
FROM analytics._stg_rockbuster s
JOIN analytics.film f
  ON f.title = s.title
JOIN analytics.actor a
  ON a.first_name = s.actor_first_name
 AND a.last_name  = s.actor_last_name;
View film_actor Table
SELECT 
    * 
FROM analytics.film_actor
LIMIT 10

Store

Create store Table
DROP TABLE IF EXISTS analytics.store CASCADE;

CREATE TABLE analytics.store (
    store_id SERIAL PRIMARY KEY,
    store_number INT UNIQUE
);
Insert Into store Table
INSERT INTO analytics.store (store_number)
SELECT DISTINCT store_number
FROM analytics._stg_rockbuster;
View store Table
SELECT 
    * 
FROM analytics.store
LIMIT 10

Staff

Create staff Table
DROP TABLE IF EXISTS analytics.staff CASCADE;

CREATE TABLE analytics.staff (
    staff_id SERIAL PRIMARY KEY,
    first_name VARCHAR(45),
    last_name VARCHAR(45),
    email VARCHAR(50),
    store_id INT REFERENCES analytics.store(store_id)
);
Insert Into staff Table
INSERT INTO analytics.staff (
    first_name,
    last_name,
    email,
    store_id
)
SELECT DISTINCT
    s.staff_first_name,
    s.staff_last_name,
    s.staff_email,
    st.store_id
FROM analytics._stg_rockbuster s
JOIN analytics.store st
  ON st.store_number = s.store_number;
View staff Table
SELECT 
    * 
FROM analytics.staff
LIMIT 10

Rental

Create rental Table
DROP TABLE IF EXISTS analytics.rental CASCADE;

CREATE TABLE analytics.rental (
    rental_id SERIAL PRIMARY KEY,
    rental_date TIMESTAMP,
    return_date TIMESTAMP,
    customer_id INT REFERENCES analytics.customer(customer_id),
    film_id INT REFERENCES analytics.film(film_id),
    staff_id INT REFERENCES analytics.staff(staff_id)
);
Insert Into rental Table

INSERT INTO analytics.rental (
    rental_date,
    return_date,
    customer_id,
    film_id,
    staff_id
)
SELECT DISTINCT
    s.rental_date,
    s.return_date,
    c.customer_id,
    f.film_id,
    st.staff_id
FROM analytics._stg_rockbuster s
JOIN analytics.customer c ON c.email = s.customer_email
JOIN analytics.film f ON f.title = s.title
JOIN analytics.staff st ON st.email = s.staff_email;
View rental Table
SELECT 
    * 
FROM analytics.rental
LIMIT 10

Payment

Create payment Table
Insert Into payment Table
View payment Table
SELECT 
    * 
FROM analytics.payment
LIMIT 10

Payment

Create payment Table
DROP TABLE IF EXISTS analytics.payment CASCADE;

CREATE TABLE analytics.payment (
    payment_id SERIAL PRIMARY KEY,
    rental_id INT REFERENCES analytics.rental(rental_id),
    amount NUMERIC(5,2),
    payment_date TIMESTAMP
);
Insert Into payment Table
INSERT INTO analytics.payment (
    rental_id,
    amount,
    payment_date
)
SELECT DISTINCT
    r.rental_id,
    s.amount,
    s.payment_date
FROM analytics._stg_rockbuster s
JOIN analytics.rental r
  ON r.rental_date = s.rental_date;
View payment Table
SELECT 
    * 
FROM analytics.payment
LIMIT 10

Final Checks

SELECT COUNT(*) FROM analytics.customer;
SELECT COUNT(*) FROM analytics.film;
SELECT COUNT(*) FROM analytics.actor;
SELECT COUNT(*) FROM analytics.rental;
SELECT COUNT(*) FROM analytics.payment;

Reflection

Q12: What normalization level did we reach?

A12:

  • 1NF \(\rightarrow\) Atomic attributes
  • 2NF \(\rightarrow\) No partial dependencies
  • 3NF \(\rightarrow\) No transitive dependencies

We have successfully derived a normalized relational schema directly from a flat CSV dataset.

ERD

alt

What is the general statistical summary of movie rental conditions?

SELECT
    COUNT(DISTINCT r.rental_id) AS total_rentals,
    SUM(p.amount) AS total_revenue,
    AVG(p.amount) AS avg_payment,
    MIN(p.amount) AS min_payment,
    MAX(p.amount) AS max_payment,
    AVG(EXTRACT(EPOCH FROM (r.return_date - r.rental_date))/86400)  AS avg_rental_days
FROM analytics.rental r
JOIN analytics.payment pON p.rental_id = r.rental_id;

\[\downarrow\]

Metric Value
Total Rentals 29,108
Total Revenue 139,312.68
Average Payment 4.28
Minimum Payment 0.00
Maximum Payment 11.99
Average Rental Duration 5.03 days

Interpretation:

  • The core pricing likely centers around the 3.99–4.99 range.
  • 11.99 suggests penalties or late fees.
  • 0.00 likely represents free rentals, promotions, or data anomalies.

Average rental duration: 5.03 days

This suggests:

  • Most customers keep movies for approximately 5 days.
  • This is slightly above typical 3–4 day rental policies, possibly indicating late returns.

How many customers are there per country?

SELECT
    co.country_name,
    COUNT(c.customer_id) AS total_customers
FROM analytics.customer c
JOIN analytics.address a   ON a.address_id = c.address_id
JOIN analytics.city ci     ON ci.city_id = a.city_id
JOIN analytics.country co  ON co.country_id = ci.country_id
GROUP BY co.country_name
ORDER BY total_customers DESC;

How much revenue does each country generate?

SELECT
    co.country_name,
    SUM(p.amount) AS total_revenue,
    COUNT(DISTINCT c.customer_id) AS total_customers
FROM analytics.payment p
JOIN analytics.rental r     ON r.rental_id = p.rental_id
JOIN analytics.customer c   ON c.customer_id = r.customer_id
JOIN analytics.address a    ON a.address_id = c.address_id
JOIN analytics.city ci      ON ci.city_id = a.city_id
JOIN analytics.country co   ON co.country_id = ci.country_id
GROUP BY co.country_name
ORDER BY total_revenue DESC;

How much revenue does each country generate?

SELECT
    co.country_name,
    SUM(p.amount) AS total_revenue,
    COUNT(DISTINCT c.customer_id) AS total_customers
FROM analytics.payment p
JOIN analytics.rental r     ON r.rental_id = p.rental_id
JOIN analytics.customer c   ON c.customer_id = r.customer_id
JOIN analytics.address a    ON a.address_id = c.address_id
JOIN analytics.city ci      ON ci.city_id = a.city_id
JOIN analytics.country co   ON co.country_id = ci.country_id
GROUP BY co.country_name
ORDER BY total_revenue DESC;

What is the average revenue per customer by country?

SELECT
    co.country_name,
    SUM(p.amount) / COUNT(DISTINCT c.customer_id) AS avg_revenue_per_customer
FROM analytics.payment p
JOIN analytics.rental r     ON r.rental_id = p.rental_id
JOIN analytics.customer c   ON c.customer_id = r.customer_id
JOIN analytics.address a    ON a.address_id = c.address_id
JOIN analytics.city ci      ON ci.city_id = a.city_id
JOIN analytics.country co   ON co.country_id = ci.country_id
GROUP BY co.country_name
ORDER BY avg_revenue_per_customer DESC;