Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. SQL
  2. SQL
  3. Session 01: Intro to Relational Databases
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Slides
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Learning Goals
  • Introduction
    • What is a Database
    • Spreadsheets vs. Databases
    • Spreadsheet vs. Database Workflow
    • Spreadsheets vs. Databases Summary
  • Types of Data
    • Structured Data
    • Unstructured Data
    • Semi-Structured Data
  • Types of Databases
    • Relational Databases
    • NoSQL Databases
    • OLAP Systems (Online Analytical Processing)
    • OLTP Systems (Online Transaction Processing)
  • Type of Storing Databases
    • Centralized Databases
    • Distributed Databases
  • Data Management Architectures
    • Data Warehouse (DW)
    • Data Mart (DM)
    • Data Lake (DL)
    • Data Lakehouse (DLH)
  • Intro to SQL
  • Building our first Database
    • What are we going to achieve?
    • Step 1: Create New GitHub Repository
    • Step 2: Clone the Repository
    • Step 3: Download the data
    • Step 4: Update .gitignore
    • Step 5: Creating Init folder and SQL files
    • Step 6: Create .env file
    • Step 7: Create the docker-compose.yaml File
    • Step 8: Run the Docker Environment
    • Step 9: View the Tables
    • Step 10: Our First Queries | Preview
    • Step 11: Close and Delete Containers
    • Step 12: Every time you re-run
    • STEP 13: Detached Mode
  1. SQL
  2. SQL
  3. Session 01: Intro to Relational Databases

Session 01: Intro to Relational Databases

SQL
Query Plan
Queries
ORDER BY
GROUP BY
DISTINCT

Learning Goals

  • Explain the difference between relational databases and non-relational databases
  • Explain the importance of online analytical processing databases (OLAP) and relational database management systems (RDBMS)
  • Data Warehouse vs Data Mart vs Data Lake vs Data Lakehouse
  • Set up a database environment using docker

Introduction

Now, you have reached the stage in the Program where you need to learn how to manage your data in an institutional way.

To do so, you will need to know structured query language (SQL), the first language of:

  • Data analysts
  • Analytics Engineers
  • Data engineers
  • Data scientists.
TipAbout the importance of SQL

You cannott call yourself a data analyst until you know the basics of SQL


SQL is the language of Tables!

Besides SQL, you will explore various types of databases and what makes them different from Excel. During the rest of the session, you will get in-depth look at current, everyday practices in the field.

Since the best way to learn SQL is to use it, you will get hands-on tasks at the end of every Exercise.

What is a Database

A database is a collection of stored data, usually organized in tables of rows and columns and managed by a database management system (DBMS). Analysts collectively refer to the data, the DBMS, and other related tools as the database.

While there are several ways to store data in a database, the most common approach is to store the data in columns and rows. The intersection of each column and row is called a cell, and each cell represents a data element.

A Data element could be:

  • numeric:
  • text
  • you name it…

Table: Example of rows, columns, and a highlighted cell

Col A Col B Col C
Row 1 A1 B1 C1
Row 2 A2 B2 (Cell) C2
Row 3 A3 B3 C3

Cell = intersection of Row 2 and Column B → B2.

Spreadsheets vs. Databases

Databases differ from spreadsheets (excel-like tools) in several important ways, particularly in scalability, performance, and the ability to support multiple users. The key distinctions include:

  • User Access
    • Spreadsheets are typically designed for single-user access.
    • Databases allow multiple users to access and manipulate the data simultaneously.
  • Data Capacity
    • Spreadsheets handle relatively small datasets (usually up to a few hundred thousand rows).
    • Databases can efficiently store and process millions or even billions of records.
  • Performance
    • Spreadsheets load the entire dataset into memory when opened.
    • Databases retrieve only the necessary records, improving speed and scalability.
  • Query Capability
    • Spreadsheets require manual filtering, sorting, and counting.
    • Databases allow precise and fast queries using SQL.

Spreadsheet vs. Database Workflow

Let’s say you have a dataset of actors, and you want to count how many actors have the first name “Ed”.

actor_id first_name last_name last_update
1 Penelope Guiness 2013-05-26 14:47:57.62
2 Nick Wahlberg 2013-05-26 14:47:57.62
3 Ed Chase 2013-05-26 14:47:57.62
4 Jennifer Davis 2013-05-26 14:47:57.62
5 Johnny Lollobrigida 2013-05-26 14:47:57.62
6 Bette Nicholson 2013-05-26 14:47:57.62
7 Grace Mostel 2013-05-26 14:47:57.62
8 Matthew Johansson 2013-05-26 14:47:57.62
9 Joe Swank 2013-05-26 14:47:57.62
10 Christian Gable 2013-05-26 14:47:57.62

When performing tasks such as counting how many actors have the first name “Ed”, the difference between spreadsheets and databases becomes clear:

  • Using a Spreadsheet:
    • You must manually filter the first_name column to show only rows containing “Ed.”
    • After filtering, you manually count the matching rows or read the count displayed by the spreadsheet application.
    • With large datasets, loading, filtering, and recalculating can be slow and resource-intensive, causing delays and potential crashes.
    • Spreadsheets can process only one query at a time, limiting efficiency.
  • Using a Database
    • Instead of loading the entire dataset, you can execute a single SQL query such as:

      SELECT COUNT(*) 
      FROM actor
      WHERE first_name = 'Ed';
    • The result is returned almost instantly, even with very large datasets.

    • Databases support multiple users running queries simultaneously without performance degradation.

    • They are optimized for complex, concurrent data manipulation and retrieval, making them more suitable for scalable and shared environments.

Note

Overall, spreadsheets may be adequate for small, single-user datasets, but databases provide far superior performance, scalability, and efficiency when working with larger datasets or multiple users.

Spreadsheets vs. Databases Summary

SPREADSHEET DATABASE
Designed for single-user access Designed for multiple-user access
Only one user can manipulate the data at one point in time Multiple users can manipulate the data at one point in time
Handles a limited amount of data Handles small to massive amounts of data
Basic to moderate data operations Basic to complex data operations
Slow or unable to manipulate, extract, transform, and aggregate large amounts of data Faster to manipulate, extract, transform, and aggregate large amounts of data

Types of Data

Before we learn about different types of databases, we should understand the distinction between them. Broadly speaking, these below terms refer to how data is organized:

  • structured
  • unstructured
  • semi-structured data.
ID Name City 1 Maria Oslo 2 Nick Berlin
photos, chats, mp3…
<user id=“1001”> “name”: “Maria” “roles”: [“admin”]

Structured Data

Each row in the table represents a distinct entity (in this case, an actor), while each column corresponds to a specific attribute such as the actor’s first name. The intersection of a row and a column forms a cell, and each cell contains a single data value.

Importantly, every column is assigned a predefined data type, which determines the kind of information it can store. For example, the first_name column uses a text data type, meaning all values in this column must consist of alphabetical characters only.

ID Name City 1 Maria Oslo 2 Nick Berlin

Sample Data

actor_id first_name last_name last_update
1 Penelope Guiness 2013-05-26 14:47:57.62
2 Nick Wahlberg 2013-05-26 14:47:57.62
3 Ed Chase 2013-05-26 14:47:57.62
4 Jennifer Davis 2013-05-26 14:47:57.62
5 Johnny Lollobrigida 2013-05-26 14:47:57.62
6 Bette Nicholson 2013-05-26 14:47:57.62
7 Grace Mostel 2013-05-26 14:47:57.62
8 Matthew Johansson 2013-05-26 14:47:57.62
9 Joe Swank 2013-05-26 14:47:57.62
10 Christian Gable 2013-05-26 14:47:57.62

Unstructured Data

Unstructured data refers to information that does not follow a predefined model or tabular format. Unlike structured data, it has no fixed rows, columns, or consistent organization. Common examples include:

  • Emails
  • Text messages
  • Images and graphics
  • Audio and video files

Because it lacks a clear structure, unstructured data is more difficult to search, process, and analyze. Extracting useful information often requires significant manual effort or specialized tools. It can be compared to searching for every instance of a specific word in a book—possible, but time-consuming and inefficient without the right indexing or processing methods.

photos, chats, mp3…

Semi-Structured Data

Semi-structured data lies between structured and unstructured data. While it does not follow a rigid tabular format, it still maintains a discernible organizational framework, making it more flexible than strictly structured data.

This type of data can be analyzed effectively only when you understand the organizational rules embedded within it. A common example is HTML: although it is not stored in rows and columns, it contains predictable tags and nested elements that define the structure of the information. To locate specific content within an HTML file, you must first understand what each tag represents.

Semi-structured data therefore offers:

  • More flexibility than structured tables
  • More organization than raw unstructured content
  • Better adaptability for evolving or complex data formats
<user id=“1001”> “name”: “Maria” “roles”: [“admin”]

Types of Databases

As you learn about SQL, you will be working with a relational database, a common type of database that all analysts need to be familiar with. That said, relational databases are not only type of database out there. Let’s look at the most important ones and their defining features:

  1. Relational Databases
  2. NoSQL Databases
  3. Graph Databases
Important
  • Databases could also be centralized or distributed.
  • Databases could also be online analytical processing (OLAP) or online transaction processing (OLTP) system.

Relational Databases

A relational database is a collection of data stored in one or more tables of rows and columns.

Between the different tables, there are links, known as relationships. As you have already learned, data analysts use SQL to query and manipulate the data in relational databases.

Popular relational database management systems (RDBMS) include Microsoft SQL Server,MySQL, Oracle DB and PostgreSQL. Each of these RDBMS is owned by different companies:

  • PostgreSQL: PostgreSQL Community
  • MySQL: Oracle
  • Oracle: Oracle
  • MSSQL: Microsoft
Feature PostgreSQL SQL Server Oracle MySQL
License Open-source Commercial + Express Commercial Open-source (Oracle-owned)
OS Support Linux, macOS, Windows Windows, Linux Linux, Unix, Windows Linux, Windows
Strengths Standards compliant, JSONB, extensible Enterprise BI, T-SQL, Microsoft ecosystem Scalability, security, enterprise features Fast reads, simple web workloads
Best Use Cases Analytics, complex apps, data science BI, ERP, corporate apps Finance, telecom, large enterprises Web apps, lightweight services

NoSQL Databases

Unlike relational databases, NoSQL databases have a flexible data model, and they are used to store unstructured, semi-structured, and even structured data. In general, you wouldn’t use SQL to access a NoSQL database, though some have SQL-like languages.

As web applications have become more complex, NoSQL databases have become popular thanks to their flexibility. NoSQL databases are used for applications that change frequently as new functions and features are added; for instance, the social networking site LinkedIn uses a NoSQL database.

Thus, NoSQL databases are non-relational systems designed for scalability, flexibility, and high-volume data. They fall into several major categories depending on their data model.

Important

NoSQL stands for Not Only SQL and refers to database systems that are generally non-relational, designed for flexible schemas and distributed workloads.


Some NoSQL systems provide SQL-like query interfaces, but they do NOT follow the relational model

Key–Value Databases

  • Concept: Simple key → value storage.
  • Popular systems: Redis, Amazon DynamoDB, Riak KV
  • Example (Redis):
{
  'course':'Data Analytics',
  'company':'ACA'
}

Document Databases

  • Concept: JSON-like documents with flexible schema.
  • Popular systems: MongoDB, CouchDB, Firestore
  • Example (MongoDB):
{
  "user_id": 1001,
  "name": "Maria",
  "roles": ["admin", "manager"]
}

Column-Family (Wide-Column) Databases

  • Concept: Data stored in column families across distributed nodes.
  • Popular systems: Apache Cassandra, HBase, ScyllaDB
  • Example (Cassandra CQL):
INSERT INTO users (user_id, name, age)
VALUES (1001, 'Maria', 33);

Graph Databases

  • Concept: Nodes and relationships.
  • Popular systems: Neo4j, ArangoDB, Amazon Neptune
  • Example (Neo4j Cypher):
CREATE (u:User {name: "Maria"})
CREATE (p:Project {title: "Analytics App"})
CREATE (u)-[:LEADS]->(p);


Time-Series Databases

  • Concept: Optimized for timestamped metrics.
  • Popular systems: InfluxDB, TimescaleDB, Prometheus
  • Example (InfluxDB Line Protocol):

Mostly used among DevOps for monitoring


Search Engine Databases

  • Concept: Full-text indexing and search.
  • Popular systems: Elasticsearch, OpenSearch, Apache Solr
  • Example (Elasticsearch Query):
{
  "query": {
    "match": {
      "message": "database error"
    }
  }
}

Mostly used for querying logs


Summary Table

NoSQL Type Key Concept Popular Databases
Key–Value Key → Value lookups Redis, DynamoDB, Riak
Document JSON-like documents MongoDB, CouchDB, Firestore
Column-Family Wide-column storage Cassandra, HBase, ScyllaDB
Graph Nodes + edges Neo4j, ArangoDB, Amazon Neptune
Time-Series Time-stamped metrics InfluxDB, TimescaleDB, Prometheus
Search Engine Text search and indexing Elasticsearch, OpenSearch, Solr

OLAP Systems (Online Analytical Processing)

OLAP systems are optimized for READ-heavy analytical workloads. They store large volumes of historical data and support complex queries, aggregations, and trend analysis.

Key Features:

  • Designed primarily for reading and analyzing data.
  • Not optimized for frequent inserts/updates/deletes.
  • Stores historical and summarized information.
  • Supports advanced analytics: dashboards, reports, trends, forecasting.
  • Typically powered by data warehouses or data marts.

Example: An e-commerce analyst retrieves last year’s order data from a data warehouse and analyzes trends across months, categories, or customer segments.

Common Technologies: Amazon Redshift, Snowflake,BigQuery, Apache Hive, Microsoft SQL Server Analysis Services.


OLTP Systems (Online Transaction Processing)

OLTP systems are optimized for fast inserts, updates, and deletes. They handle high-volume transactional operations in real time.

Key Features:

  • Processes large numbers of small, atomic transactions.
  • Ensures consistency using ACID properties.
  • Optimized for real-time operations.
  • Used by operational systems rather than analytics.

Example: A bank’s ATM network continuously inserts, updates, or deletes records as customers make withdrawals, deposits, or transfers.

Common Technologies: PostgreSQL, MySQL, Oracle, SQL Server, MongoDB (as transactional store).

ImportantAbout ACID, BASE, and the CAP Theorem

ACID Transactions

In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee the correctness of database transactions.

A transaction is treated as a single unit of work: - Either all operations succeed, or none do - The database remains in a valid state even in the presence of errors, crashes, or power failures

Example:
A bank transfer that debits one account and credits another must either complete fully or not happen at all.

ACID properties are commonly associated with relational databases and OLTP systems.


BASE Systems

BASE (Basically Available, Soft state, Eventually consistent) is an alternative design philosophy used mainly in distributed systems.

BASE systems: - Prioritize availability over immediate consistency - Allow temporary inconsistencies - Guarantee that data will become consistent eventually

This approach is common in large-scale, distributed NoSQL systems where uptime and scalability are more critical than strict transactional guarantees.


The CAP Theorem

The CAP theorem states that a distributed system cannot simultaneously guarantee all three of the following properties:

  • Consistency (C): every read returns the most recent write
  • Availability (A): every request receives a response
  • Partition Tolerance (P): the system continues operating despite network failures

In the presence of a network partition (which is unavoidable in distributed systems), a database must choose between Consistency and Availability.

As a result:

  • Some systems favor strong consistency and may reject requests during failures
  • Others favor high availability and may return temporarily inconsistent data

Important Clarification

The CAP theorem does NOT state that a system must be either ACID or BASE.

  • CAP describes trade-offs in distributed systems
  • ACID and BASE describe transactional and consistency models

Modern databases often implement hybrid approaches, providing:

  • ACID guarantees within limited scopes (e.g., single node, single partition)
  • Relaxed consistency across distributed components

Trade-offs always exist, but real-world systems rarely fit into pure categories.


OLAP vs OLTP: Summary

Feature OLAP OLTP
Primary purpose Analytics, reporting Real-time transactions
Query type Read-heavy, complex Write-heavy, simple
Data volume Historical, large datasets Current transactional data
Performance optimized for Aggregations, scans Fast inserts/updates/deletes
Common use cases Dashboards, BI, trend analysis Banking, retail transactions, booking systems
Users Analysts, data scientists End-users, applications, machines

NoteReading

OLTPvsOLAP

NoneReading

SQL vs NoSQL

Type of Storing Databases

The three kinds of databases discussed above can be centralized or distributed. Put simply, these concepts refer to where the database is stored.

Illustration of centralized vs distributed databases

Centralized Databases

A centralized database runs on a single machine (single-node). All users, analysts, and administrators connect to this one server.

Key Features:

  • All data lives in one powerful machine.
  • Easy to manage and maintain.
  • Can handle billions of records and terabytes of data.
  • Scaling requires buying larger, more expensive hardware (vertical scaling).
  • Once the system reaches hardware limits, performance and management become challenging.

Common in: Traditional enterprises, internal IT systems, on-premise setups.

Analogy: A workstation or laptop, but with high-end server specifications.


Distributed Databases

A distributed database stores data across multiple machines (nodes). These nodes work together to form one logical database.

Key Features:

  • Designed for massive-scale data.
  • Scales horizontally by adding more machines.
  • Faster data access and search due to parallel processing.
  • More cost-effective storage.
  • Higher availability and fault tolerance.

Common in:

  • Banking systems
  • Telecommunications
  • Large tech companies (Google, Facebook, Amazon, Netflix)
ImportantWhy it is used?

To handle constantly growing data, high query loads, global users, and real-time applications.

Centralized vs Distributed Databases: Summary

Feature Centralized Database Distributed Database
Storage location One machine Many machines
Scalability Vertical (add more power) Horizontal (add more nodes)
Typical capacity High but limited by hardware Extremely high, virtually unlimited
Cost Expensive to scale More cost-efficient

Data Management Architectures

Okay, we got high level understanding about the databases and their types, however in real life applications we need to master the so called Data Management Architecture (DMA). Let’s go over them chronologically:

  1. Data Warehouse (DW)
  2. Data Mart (DM)
  3. Data Lake (DL)
  4. Data Lakehouse (DLH)

Data Warehouse (DW)

The earliest large-scale analytics storage (1980-90s).

A Data Warehouse is a centralized, structured, and governed analytical storage designed for reporting, business intelligence, and decision-making. Everything inside is modeled, cleaned, validated, and optimized for SQL analytics.

Purpose

  • Provide a single version of truth for the organization.
  • Serve business reporting, dashboards, KPIs.
  • Allow analysts to run consistent, repeatable metrics.

Architecture

  • Core storage: relational tables in Star or Snowflake schema.
  • ETL/ELT pipelines transform raw data into structured models.
  • Strong governance: data quality rules, validation layers, semantic models.

Schema Approach

  • Schema-on-write: data must be modeled before it enters.
  • Facts (large numerical tables).
  • Dimensions (lookup tables with business attributes).
  • Conformed dimensions for uniform interpretation (e.g., Customer, Product).

Query Model

  • Aggregations, joins, slicing/dicing, reporting.
  • Optimized for read-heavy workloads.

Advantages

  • High data quality
  • Strong governance
  • Fast SQL analytics
  • Business-friendly, predictable models

Limitations

  • Rigid: changes require redesign
  • Poor for raw, unstructured data
  • Expensive for large-scale storage
  • Not ideal for ML pipelines that need raw historical data

Typical Tools/Platforms

  • Snowflake (warehouse mode)
  • Amazon Redshift
  • Google BigQuery (classic DW usage)
  • Teradata
  • Oracle Exadata

Relationship to others

  • DW is the parent of DM.
  • DW predates DL and DLH.

Data Mart (DM)

Introduced after Data Warehouses (late 1990s).

A Data Mart is a subject-specific, department-focused subset of a Data Warehouse. It is optimized for a particular business domain such as Marketing, Finance, Sales, HR, or Operations.

Purpose

  • Reduce load on the central Data Warehouse.
  • Deliver faster, domain-specific analytics.
  • Allow departments to have tailored KPIs and models.
  • Improve performance by minimizing unnecessary joins and tables.

Architecture

  • Usually built as dependent data marts, sourced from the DW.
  • Sometimes built as independent data marts (less common today).
  • Domain-specific schemas (e.g., Marketing Mart, Finance Mart).
  • Can use star/snowflake schemas but simplified for a business unit.

Schema Approach

  • Schema-on-write (same as DW).
  • Contains facts/dimensions relevant only to one business domain.
  • May extend or customize dimensions (e.g., Marketing_Customer).

Query Model

  • Faster analytical queries for a single domain.
  • Reduced table complexity.
  • Supports dashboards, KPI reporting, and departmental insights.

Advantages

  • High performance for domain-specific workloads
  • Reduces contention on the central DW
  • Simpler models for end users
  • Enables data ownership by individual departments

Limitations

  • Can create data silos if unmanaged
  • Requires synchronization with DW
  • Potential duplication of logic if marts diverge

Data Silo

Typical Tools/Platforms

  • Snowflake Data Marts
  • Redshift Data Marts
  • BigQuery departmental datasets
  • Microsoft SQL Server departmental warehouses

Relationship to others

  • DM is the child of DW.
  • Relies on DW for consistency.
  • Not directly related to Data Lakes or Lakehouses except in hybrid architectures.

Data Lake (DL)

Introduced during the Big Data era (2010s).

A Data Lake is a large-scale, low-cost, raw-data storage system that can ingest structured, semi-structured, and unstructured data without modeling.

Purpose

  • Store massive amounts of raw data cheaply.
  • Support machine learning and exploratory analytics.
  • Capture logs, JSON files, clickstreams, IoT data, images, audio.
  • Preserve granular, historical data with no loss.

Architecture

  • Built on object storage (S3, GCS, ADLS, or HDFS).
  • Organized into zones: raw, curated, prepared.
  • Ingestion via ELT pipelines.
  • Often uses frameworks like Spark, Hive, Presto, or Dremio.

Schema Approach

  • Schema-on-read: structure is applied only at query time.
  • Supports open formats: Parquet, ORC, Avro, CSV, JSON.
  • Can handle unstructured formats (text, logs, images).

Query Model

  • Flexible but slower than DW for SQL analytics.
  • Best for ML workloads requiring raw historical data.
  • Handles batch and streaming ingestion.

Advantages

  • Extremely flexible storage model
  • Very low cost for large-scale data
  • Excellent for ML, experimentation, and data science
  • Decouples storage from compute

Limitations

  • No ACID guarantees in traditional lakes
  • Poor governance (risk of “data swamp”)
  • Hard for business teams to use directly
  • Not optimized for BI dashboards

Typical Tools/Platforms

  • AWS S3 + Athena / EMR
  • Azure Data Lake Storage
  • Google Cloud Storage + BigQuery external tables
  • Hadoop HDFS
  • Databricks in lake mode

Relationship to others

  • DL evolved after DW and DM.
  • Predecessor of the DLH.
  • DW and DL often coexist in two-tier architectures.

Data Lakehouse (DLH)

The most modern architecture (2020s).

A Data Lakehouse is a unified, ACID-compliant, analytics + ML architecture built on top of Data Lake storage.
It merges the strengths of Data Warehouses and Data Lakes into a single system.

Purpose

  • Provide a single platform for SQL analytics, BI, ML, and streaming.
  • Avoid duplication between DW and DL.
  • Guarantee ACID transactions and strong governance on cheap lake storage.
  • Support both structured and unstructured data.

Architecture

  • Storage on cloud object stores (S3, ADLS, GCS).
  • ACID table formats: Delta Lake, Apache Iceberg, Apache Hudi.
  • Metadata layers for schema enforcement, governance, versioning.
  • Compute engines: Spark, Databricks, Presto/Trino, Snowflake, BigQuery.

Schema Approach

  • Supports both schema-on-write and schema-on-read.
  • Allows schema evolution, time travel, and versioning.
  • Enforces constraints even on raw lake storage.

Query Model

  • High-performance SQL (warehouse-like).
  • ML workloads (lake-like).
  • Supports batch, streaming, and real-time analytics.
  • Works for both BI dashboards and ML pipelines.

Advantages

  • Unifies DW + DL into one architecture
  • ACID guarantees on object storage
  • Suitable for both BI and ML
  • Simplifies pipelines and governance
  • Scalable and cost-effective

Limitations

  • Requires organizational adoption and re-skilling
  • Governance maturity varies by platform
  • Classic DW teams may struggle to transition

Typical Tools/Platforms

  • Databricks Lakehouse (Delta)
  • Snowflake (Unistore + Iceberg)
  • BigQuery (Lakehouse-friendly design)
  • Apache Iceberg + Trino/Presto
  • Apache Hudi platforms

Relationship to others

  • DLH is the evolution of the Data Lake.
  • DLH aims to unify and simplify analytics workloads.
  • DLH may replace DW in modern architectures.

**Database Layers

  • Operational Database (OLTP): Primary system of record for real-time transactions; normalized schema, ACID guarantees.
  • Data Warehouse: Curated, structured, and historical store optimized for analytics; fed via ETL/ELT pipelines and supports dimensional models (star/snowflake).
  • Data Lake: Low-cost storage for raw, semi-structured, and unstructured data at scale; schema-on-read enables flexible exploration.
  • Data Lakehouse: Combines lake storage with warehouse-like governance and performance (ACID tables, time travel, unified catalog).
  • Data Mart: Focused slice of the warehouse or lakehouse tailored for a specific team/domain to simplify access and controls.
  • Staging/Raw Layer: Landing zone for ingested data before cleansing and modeling; keeps reproducible snapshots of source data.

Intro to SQL

SQL (Structured Query Language) was first developed by IBM in the early 1970s and released publicly in 1979. It was later standardized by the American National Standards Institute (ANSI), establishing it as the primary language for interacting with relational databases.

SQL is used for three core tasks:

  • Manipulating Data: inserting, updating, or deleting records within a database.
  • Searching Data: retrieving all films that feature a specific actor (e.g., Ed Chase).
  • Defining Database Structures: For instance, creating new tables or adding columns to existing ones.

Like any language, SQL follows specific syntax rules.

  • An English sentence ends with a period.
  • A SQL statement ends with a semicolon (;).
Warning

While the foundational syntax is consistent, slight variations may occur depending on the database system (e.g., PostgreSQL vs. SQL Server). Understanding these conventions is essential before writing your own SQL queries.


ImportantIs SQL a Programing Language?

SQL generally counts as code even though it is NOT technically a programming language.

You willl encounter a common phenomenon when you start working with it: finding yourself in situations where you cannot solve a coding problem using only your existing knowledge. Do not worry, as this happens to all beginners, and it is something you will have to learn to deal with.

Most coders spend a fair amount of time researching answers to their problems online. There is no need to get frustrated if you run into a brick wall—simply Google what you are trying to achieve in SQL (or any other language), and you may be surprised how quickly you will find an answer to your problem!

Building our first Database

Now that we have high level understanding about the databases, their types and architectures, it is high time to build our first database and continue.

In order to be able to build and run the database you need to keep docker desktop active. If you have note installed Docker yet, check out the installation and testing steps here PS you will see how easy it is.

What are we going to achieve?

We are going to build a relational database which will consist of the following tables.

  1. Sales
  2. Time Dimension
  3. Product Dimension
  4. Employee Dimension
  5. Customer Dimension

Important

This is going to be part of your portfolio, make sure the make it as structured as possible.

Step 1: Create New GitHub Repository

  1. Log in to GitHub.
  2. Click New Repository.
  3. Repository name: sql-analytics-portfolio
  4. Check Add a README file.
  5. Check Add .gitignore and choose None.
  6. Click Create Repository.

Step 2: Clone the Repository

  1. Create a folder named aca on your Desktop or wherever you want
  2. Open your terminal there. You should see ~/aca
  3. Type in your terminal:
git clone <your-repository-url>
  1. Navigate to that folder by typing:
cd  <your-repository-url>
  1. type code . to open the project with VS Code
  2. RECOMMENDED save the project in VS Code using the Project Manager extension

Step 3: Download the data

  1. Create a new folder inside of the project named data
  2. Download the following csv files and put in the data folder
  3. customers.csv
  4. employees.csv
  5. orders.csv
  6. products.csv
  7. sales.csv
  8. Add and commit the changes done so far
git add .
git commit -m "Added data folder with CSV datasets"

Step 4: Update .gitignore

  1. Open the .gitignore file in your repository and add the following two lines at the bottom:
pgadmin_data/
postgres_data/
  1. Again add and commit
git add .gitignore
git commit -m "Updated .gitignore to exclude database volumes"

Step 5: Creating Init folder and SQL files

We are going to create init files which will be executed only during the first db initialization.

  1. Create a folder named init
  2. Inside of the init folder create a file named 01_schema.sql for database schema code. Copy the respective chunk there
  3. Inside of the init folder create a file named 02_etl.sql for data loading to SQL
  4. Add and Commit the changes
  git add init/
  git commit -m "Added initial schema and ETL SQL files"

01_schema.sql

-- 01_schema.sql

-- Safety: drop if you are iterating (comment these in production)
-- DROP TABLE IF EXISTS sales CASCADE;
-- DROP TABLE IF EXISTS orders CASCADE;
-- DROP TABLE IF EXISTS products CASCADE;
-- DROP TABLE IF EXISTS customers CASCADE;
-- DROP TABLE IF EXISTS employees CASCADE;

CREATE TABLE IF NOT EXISTS employees (
  employee_id   SERIAL PRIMARY KEY,
  first_name    TEXT,
  last_name     TEXT,
  email         TEXT,
  salary        NUMERIC
);

CREATE TABLE IF NOT EXISTS customers (
  customer_id   INTEGER PRIMARY KEY,
  customer_name TEXT,
  address       TEXT,
  city          TEXT,
  zip_code      TEXT
);

CREATE TABLE IF NOT EXISTS products (
  product_id    INTEGER PRIMARY KEY,
  product_name  TEXT,
  price         NUMERIC,
  description   TEXT,
  category      TEXT
);

-- orders: include year/quarter/month as stored columns (loaded from CSV)
CREATE TABLE IF NOT EXISTS orders (
  order_id    INTEGER PRIMARY KEY,
  order_date  TIMESTAMP,
  year        INT,
  quarter     INT,
  month       TEXT
);

CREATE TABLE IF NOT EXISTS sales (
  transaction_id INTEGER PRIMARY KEY,
  order_id       INTEGER REFERENCES orders(order_id)     ON DELETE RESTRICT,
  product_id     INTEGER REFERENCES products(product_id) ON DELETE RESTRICT,
  customer_id    INTEGER REFERENCES customers(customer_id) ON DELETE RESTRICT,
  employee_id    INTEGER REFERENCES employees(employee_id) ON DELETE SET NULL,
  total_sales    NUMERIC,
  quantity       INTEGER,
  discount       NUMERIC
);

-- Helpful indexes
CREATE INDEX IF NOT EXISTS idx_sales_order_id   ON sales(order_id);
CREATE INDEX IF NOT EXISTS idx_sales_product_id ON sales(product_id);
CREATE INDEX IF NOT EXISTS idx_sales_customer_id ON sales(customer_id);
CREATE INDEX IF NOT EXISTS idx_orders_date      ON orders(order_date);

02_etl.sql

-- COPY must read files inside the container; we mounted ./data to /docker-entrypoint-initdb.d/data
\echo 'Loading employees...'
COPY employees(employee_id,first_name,last_name,email,salary)
FROM '/docker-entrypoint-initdb.d/data/employees.csv'
WITH (FORMAT csv, HEADER true);

\echo 'Loading customers...'
COPY customers(customer_id,customer_name,address,city,zip_code)
FROM '/docker-entrypoint-initdb.d/data/customers.csv'
WITH (FORMAT csv, HEADER true);

\echo 'Loading products...'
COPY products(product_id,product_name,price,description,category)
FROM '/docker-entrypoint-initdb.d/data/products.csv'
WITH (FORMAT csv, HEADER true);

\echo 'Loading orders...'
COPY orders(order_id,order_date,year,quarter,month)
FROM '/docker-entrypoint-initdb.d/data/orders.csv'
WITH (FORMAT csv, HEADER true);

\echo 'Loading sales...'
COPY sales(transaction_id,order_id,product_id,customer_id,employee_id,total_sales,quantity,discount)
FROM '/docker-entrypoint-initdb.d/data/sales.csv'
WITH (FORMAT csv, HEADER true);
Caution

Pay attention to the orders and think about it.

Step 6: Create .env file

This file is going to setup the database parameters

  1. Create .env file in the root
  2. Copy and paste the following there
PORT=5432
DB_USER=admin
DB_PASSWORD=password
DB_NAME=aca
PGADMIN_EMAIL=admin@admin.com
PGADMIN_PASSWORD=admin
  1. Add and commit .env file

git add .env
git commit -m "Added environment variables for PostgreSQL"

Step 7: Create the docker-compose.yaml File

  1. Create docker-compose.yaml file
  2. copy and paste the following code there
services:
  db:
    container_name: postgresql_db_aca
    image: postgres:17
    restart: always
    env_file: .env
    environment:
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: ${DB_NAME}
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER} -d ${DB_NAME}"]
      interval: 10s
      timeout: 5s
      retries: 10
    volumes:
      # Persist database data (keep your bind mount; a named volume also works)
      - ./postgres_data:/var/lib/postgresql/data
      # Run these once on first initialization
      - ./init:/docker-entrypoint-initdb.d
      # Make CSVs available to COPY during init (read-only)
      - ./data:/docker-entrypoint-initdb.d/data:ro

  pgadmin:
    container_name: pgadmin_aca
    image: dpage/pgadmin4
    restart: always
    env_file: .env
    environment:
      PGADMIN_DEFAULT_EMAIL: ${PGADMIN_EMAIL}
      PGADMIN_DEFAULT_PASSWORD: ${PGADMIN_PASSWORD}
    ports:
      - "5050:80"
    volumes:
      - ./pgadmin_data:/var/lib/pgadmin
      - ./queries:/var/lib/pgadmin/storage/admin_admin.com/queries
    depends_on:
      - db
  1. Add and commit

git add docker-compose.yaml
git commit -m "Added Docker Compose configuration for PostgreSQL and pgAdmin"
  1. Now your repository must have the following structure
.
├── README.md
├── .gitignore                     # Ignoring
├── .env                     # Environment variables for PostgreSQL
├── docker-compose.yaml      # Docker Compose configuration
├── data/                    # CSV datasets
│   ├── customers.csv
│   ├── employees.csv
│   ├── orders.csv
│   ├── products.csv
│   ├── sales.csv
├── init/                    # SQL initialization scripts
│   ├── 01_schema.sql        # Database schema creation
│   ├── 02_etl.sql           # Data loading and ETL process

Step 8: Run the Docker Environment

  1. Open the integrated terminal in VS Code.
  2. Ensure Docker Desktop is running on your machine.
  3. Then start the database services:
docker compose up

Step 9: View the Tables

  1. Once running, access pgAdmin in your browser by http://localhost:5050
  2. username: admin@admin.com
  3. password: admin

  1. Open Object\(\rightarrow\)Register\(\rightarrow\)Server

  2. In the general under the name column

  3. Click on the connection and fill in:

    1. hostname must be the same as the postgres service name in docker-compose.yaml, which is db,
    2. port 5432
    3. maintenance database: aca
    4. username: admin
    5. password: password
  4. Navigate to in my case: \(any_name \rightarrow aca \rightarrow Schemas \rightarrow public \rightarrow Tables\)

  5. Click on the query tool copy and paste the queries provided in the Step 10

Step 10: Our First Queries | Preview

Use the following SQL statements to preview the first 10 rows from each table after loading your CSV data.

A schema is a logical container inside a PostgreSQL database. You can think of it as a folder that holds database objects.

The default schema is public.

SELECT * FROM public.customers LIMIT 10;

SELECT * FROM public.employees LIMIT 10;

SELECT * FROM public.orders LIMIT 10;

SELECT * FROM public.products LIMIT 10;

SELECT * FROM public.sales LIMIT 10;

Step 11: Close and Delete Containers

As soon as you finish the job related to the DB, remove the containers by typing

docker-compose down

Step 12: Every time you re-run

We are going to deactivate and activate constantly, so you will get used to it. Don’t worry :)

docker-compose up
Important

In case of ANY CHANGE in the docker-compose.yaml file:

  1. docker compose down -v
  2. delete postgres_data folder
  3. docker compose up --build

STEP 13: Detached Mode

Once we have databes working properly instead of docker compose up you can use the following command to run the containers in detached mode:

docker compose up -d