Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. SQL
  2. SQL
  3. Session 06: DA with SQL | String Functions
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Slides
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Text Functions | General Overview
  • Running Docker
    • Why Text Functions Matter in Analytics
    • Typical Analytical Use Cases
    • Analytical Principle
  • Dummy Setup | Raw Phone Numbers and Names
  • LENGTH()
    • Inspecting the Lengths
    • Interpretation
    • Why This Step Is Critical
    • Analytical Insight
  • TRIM()
    • Why TRIM() Matters in Analytics
    • Applying TRIM()
    • Result Interpretation
    • Directional Variants
    • Analytical Best Practice
  • LOWER(), UPPER(), INITCAP()
    • Why Case Normalization Matters
    • LOWER() | Normalize to Lowercase
    • UPPER() | Normalize to Uppercase
    • INITCAP() | Proper Name Formatting
    • Analytical Comparison
    • Analytical Best Practice
  • REPLACE()
    • Why REPLACE() Matters in Analytics
    • Basic Usage
    • Removing Known Separators
    • Chaining REPLACE() Calls
    • Analytical Limitations of REPLACE()
    • When to Use REPLACE()
  • REGEXP_REPLACE()
    • Why REGEXP_REPLACE() Matters in Analytics
    • REGEXP_REPLACE() | Syntax Overview
    • Extracting Digits Only
    • Removing Annotations Inside Parentheses
    • REGEXP vs REPLACE | When to Choose Which
    • Analytical Warning
  • SUBSTRING()
    • Why SUBSTRING() Matters in Analytics
    • SUBSTRING() | Basic Positional Extraction
    • Example 1 | Extract Last 8 Digits of Phone Numbers
    • Example 2 | SUBSTRING() with Fixed Structure
    • Example 3 | Extract Digits Using Regex
    • Positional vs Pattern-Based SUBSTRING()
    • Analytical Best Practice
  • CONCAT() and ||
    • Why CONCAT() Matters in Analytics
    • CONCAT() | Syntax Overview
    • Example 1 | Building Full Names (Raw)
    • Output
    • Example 2 | CONCAT() + INITCAP() (Recommended Pattern)
    • Output
    • CONCAT() vs ||
    • Key Difference
  • POSITION() / STRPOS()
    • Why POSITION() / STRPOS() Matters in Analytics
    • POSITION() | Syntax Overview
    • Example 1 | Detect Parentheses in Categories
    • Analytical Interpretation
    • Example 2 | STRPOS() for Phone Number Diagnostics
    • Analytical Interpretation
    • POSITION() as a Validation Tool
    • POSITION() vs SUBSTRING()
    • Analytical Best Practice
  • SPLIT_PART()
    • Why SPLIT_PART() Matters in Analytics
    • SPLIT_PART() | Syntax
    • Example 1 | Split Phone Numbers by Hyphen (First Part)
    • Example 2 | Split Phone Numbers by Hyphen (Second Part)
    • Analytical Interpretation
    • Important Behavior to Remember
    • Example 3 | SPLIT_PART() After Cleaning
    • SPLIT_PART() vs Other String Functions
    • Analytical Best Practice
  • NULLIF()
    • Why NULLIF() Matters in Analytics
    • NULLIF() | Syntax Overview
    • Example 1 | Empty Strings to NULL
    • Analytical Interpretation
    • Example 2 | Placeholder Text to NULL
    • Example 3 | NULLIF() Before Aggregation
    • NULLIF() with CONCAT() (Defensive Pattern)
    • NULLIF() vs COALESCE()
    • Analytical Best Practice
  • LEFT() / RIGHT()
    • Why LEFT() / RIGHT() Matter in Analytics
    • Syntax Overview
    • Example 1 | Extract Phone Prefix with LEFT()
    • Analytical Interpretation
    • Example 2 | Extract Core Identifier with RIGHT()
    • LEFT() / RIGHT() vs SUBSTRING()
    • Analytical Best Practice
  • Dummy Dataset | Messy Transaction Text Data
    • Step 1 | Create Dummy Transaction Table
    • Step 2 | Populate with 1,000 Rows of Messy Data
    • Step 3 | Sanity Check
  • In-Class Assignment
    • Objective
    • Task 1 | Phone Number Diagnostics
    • Task 2 | Category Fragmentation
    • Expected Insight
  • Homework
    • Scenario
    • Part 1 | Profiling
    • Part 2 | Standardization Layer
    • Part 3 | KPI Comparison
    • Part 4 | Analytical Explanation
    • Submission Rules
  1. SQL
  2. SQL
  3. Session 06: DA with SQL | String Functions

Session 06: DA with SQL | String Functions

SQL
String Functions

Text Functions | General Overview

Text (string) functions are used to inspect, clean, standardize, and transform textual data at the row level.

Unlike numeric aggregate functions (SUM, AVG, COUNT), most text functions:

  • operate per row
  • return derived text columns
  • do not reduce the number of rows on their own

In analytical workflows, text functions are typically applied before aggregation to ensure that grouping, counting, and summarization behave correctly.

Running Docker

Lets run the Docker in detached mode:

docker compose up -d

Why Text Functions Matter in Analytics

Real-world text data is often dirty, inconsistent, and ambiguous.

Common issues include:

  • inconsistent casing (Electronics vs electronics)
  • leading or trailing spaces
  • annotations inside text (Category (Promo))
  • mixed formats (ID-123, ID_123, 123)
  • missing or partially filled values

If these issues are not handled explicitly:

  • GROUP BY produces fragmented categories
  • COUNT(DISTINCT ...) overcounts entities
  • KPIs become inconsistent across dashboards
  • joins silently fail

Text functions are therefore data quality tools, not cosmetic helpers.


Typical Analytical Use Cases

Text functions are commonly used for:

  • data cleaning
    • trimming spaces
    • normalizing case
    • removing annotations
  • standardization
    • making categories comparable
    • ensuring consistent grouping keys
  • feature engineering
    • extracting parts of strings
    • building derived text features
  • validation
    • detecting malformed values
    • identifying unexpected formats

Analytical Principle

Text transformations change how rows are grouped.

Because of this:

  • text cleaning is an analytical decision
  • not a purely technical step
  • and must be done intentionally and consistently

Dummy Setup | Raw Phone Numbers and Names

To demonstrate text normalization beyond phone numbers, let’s create dummy table which will cover the entire spector.

This allows us to later apply functions such as UPPER(), LOWER(), INITCAP(), and validation logic on names.

DROP TABLE IF EXISTS customers_raw_text;

CREATE TABLE customers_raw_text (
  customer_id   INTEGER,
  first_name    TEXT,
  last_name     TEXT,
  raw_phone     TEXT,
  category_raw  TEXT,
  birth_date    DATE
);

INSERT INTO customers_raw_text (
  customer_id,
  first_name,
  last_name,
  raw_phone,
  category_raw,
  birth_date
) VALUES
  (1, 'joHN',     'doE',        '   077600945  ',   'Accessories (Promo)', DATE '1994-03-12'),
  (2, 'MARY',     'sMiTh',      '077-600-045',      'Electronics (Old)',   DATE '1988-11-05'),
  (3, 'aLEx',     'johnSON',    '(374)-77-600-945', 'Accessories',         DATE '2001-07-23'),
  (4, 'anna',     'VAN DYKE',   '37477600945',      'Electronics (Promo)', DATE '1999-01-30'),
  (5, NULL,       'brOwn',      '77600945',         'Accessories (Test)',  DATE '1994-03-12');

What this setup gives us

  • phone numbers with inconsistent formatting
  • names with:
    • random capitalization
    • mixed casing within the same value
    • multi-part last names
    • unstructured category

This mirrors real-world dirty text data and prepares us for the next set of string functions focused on standardization and normalization.

LENGTH()

Text fields such as phone numbers often arrive in many inconsistent formats.
Before attempting to clean or standardize them, a good analyst first tries to detect which rows are malformed.

The LENGTH() function is a simple but powerful diagnostic tool for this purpose.

Inspecting the Lengths

The first question to ask is:

Do all values have the same length?

SELECT
  raw_phone,
  LENGTH(raw_phone) AS phone_length
FROM customers_raw_text;

\[\downarrow\]

Output

raw_phone phone_length
' 077600945 ' 13
'077600945' 9
'77600945' 8
'077-600-045' 11
'(374)-77-600-945' 15
'37477600945' 11

Interpretation

Even before cleaning, we can already classify rows:

  • 8 characters: likely already standardized → 77600945
  • 9 characters likely correct number with leading zero → 077600945
  • 11 characters likely contains country code or separators
  • 13+ characters clearly contains whitespace and formatting symbols

At this point:

  • spaces
  • hyphens
  • parentheses
  • country prefixes

Why This Step Is Critical

By using LENGTH() before cleaning, you can:

  • detect formatting issues early
  • identify multiple data quality patterns
  • avoid applying a one-size-fits-all transformation
  • design targeted cleaning rules

This avoids silent errors later when:

  • grouping
  • joining
  • deduplicating

Analytical Insight

Notice that:

  • LENGTH() alone already tells us which rows cannot possibly match the target
  • the target 77600945 has a clear expected length

This gives us a validation rule we can enforce after cleaning.

TRIM()

Text fields often contain invisible formatting noise, such as leading or trailing spaces.
Although these spaces are not obvious when visually inspecting data, they can:

  • break joins
  • fragment groups
  • cause false duplicates
  • invalidate validation rules

The TRIM() function is used to remove unnecessary whitespace from the beginning and end of text values.

Importantly, TRIM() removes noise, not meaning.


Why TRIM() Matters in Analytics

Whitespace has no business semantics.

That means:

  • removing it does not change the underlying value
  • but failing to remove it can distort analytical results

Because of this, TRIM() is considered a safe and low-risk transformation and is usually applied early in data-cleaning pipelines.


Applying TRIM()

Below we apply TRIM() to raw phone numbers and compare lengths before and after trimming.

SELECT
  raw_phone,
  LENGTH(raw_phone) AS length,
  TRIM(raw_phone) AS trimmed_phone,
  LENGTH(TRIM(raw_phone)) AS trimmed_length
FROM customers_raw_text;

Result Interpretation

From the output, we observe that:

  • leading and trailing spaces are removed
  • internal characters (digits, hyphens, parentheses) remain unchanged
  • some rows move closer to the expected target length
  • others still violate formatting rules

This confirms that:

  • whitespace was only one of several issues
  • further transformations are required for full standardization

Directional Variants

SQL also provides directional trimming functions:

LTRIM(raw_phone)   -- removes leading spaces only
RTRIM(raw_phone)   -- removes trailing spaces only

\[\downarrow\]

Using TRIM() is equivalent to applying both.


Analytical Best Practice

Always follow this pattern:

  1. Measure (e.g., LENGTH())
  2. Transform (e.g., TRIM())
  3. Re-measure to validate impact

This makes transformations:

  • transparent
  • auditable
  • analytically defensible
Important
  • TRIM() removes invisible formatting noise
  • it does not alter business meaning
  • it prepares text data for structural cleaning
  • it should almost always be applied before more aggressive text transformations

Whitespace issues are subtle, but ignoring them leads to silent analytical errors.

LOWER(), UPPER(), INITCAP()

Text values often differ only by capitalization, even though they represent the same entity.

For example:

  • john, John, JOHN
  • van dyke, Van Dyke, VAN DYKE

From an analytical perspective, these differences are purely cosmetic, but if left untreated they lead to:

  • fragmented groups
  • inflated COUNT(DISTINCT ...)
  • failed joins
  • inconsistent reporting

SQL provides several functions to normalize text case.


Why Case Normalization Matters

Capitalization has no business meaning, but SQL treats differently cased strings as different values.

That means:

  • 'john' ≠ 'John' ≠ 'JOHN'
  • grouping becomes unreliable
  • deduplication becomes inaccurate

Case normalization is therefore a data quality operation, not a formatting preference.


LOWER() | Normalize to Lowercase

LOWER() converts all characters in a string to lowercase.

This is the most common choice for:

  • grouping keys
  • joins
  • deduplication
  • categorical dimensions
SELECT
  first_name,
  LOWER(first_name) AS first_name_lower
FROM customers_raw_text;

When to Use

  • internal identifiers
  • grouping and joins
  • text comparison logic

UPPER() | Normalize to Uppercase

UPPER() converts all characters in a string to uppercase.

This is often used for:

  • codes
  • abbreviations
  • country or region identifiers
SELECT
  last_name,
  UPPER(last_name) AS last_name_upper
FROM customers_raw_text;

When to Use

  • standardized codes
  • reporting conventions
  • legacy system alignment

INITCAP() | Proper Name Formatting

INITCAP() converts text to title case:

  • first letter uppercase
  • remaining letters lowercase
SELECT
  first_name,
  INITCAP(first_name) AS first_name_clean
FROM customers_raw_text;

Example Transformations

raw value INITCAP result
john John
mARy Mary
van dyke Van Dyke

Analytical Comparison

Function Best Use Case
LOWER() joins, grouping, deduplication
UPPER() codes, abbreviations
INITCAP() presentation, names

Analytical Best Practice

A common and robust pattern is:

  • use LOWER() for keys and joins
  • use INITCAP() for display purposes
  • never mix raw and normalized text in analysis

Case normalization:

  • changes how rows group
  • affects counts and uniqueness
  • must be applied consistently

Like TRIM(), these functions remove noise, not meaning.

REPLACE()

After handling whitespace (TRIM()) and capitalization (LOWER(), UPPER(), INITCAP()), the next common problem in real-world text data is structural noise inside strings.

Structural noise refers to characters that are part of visual formatting but carry no analytical meaning.

Typical examples include:

  • hyphens in phone numbers
  • spaces used as separators
  • parentheses around country codes
  • dots, slashes, or underscores

If these characters are not removed, text values that represent the same entity remain incomparable.


Why REPLACE() Matters in Analytics

REPLACE() allows you to explicitly remove or substitute known characters.

From an analytical perspective:

  • the rule is deterministic
  • the transformation is transparent
  • the outcome is fully explainable

This makes REPLACE() suitable for early-stage structural cleanup, before applying more advanced pattern-based logic.


Basic Usage

The REPLACE() function has the following structure:

REPLACE(text, old_value, new_value)

It replaces all occurrences of old_value with new_value.


Removing Known Separators

Phone numbers often include hyphens for readability.

SELECT
  raw_phone,
  REPLACE(raw_phone, '-', '') AS phone_no_hyphen
FROM customers_raw_text;

\[\downarrow\]

raw_phone phone_no_hyphen
077-600-045 077600045
(374)-77-600-945 (374)77600945

Only the hyphen is removed.
All other characters remain untouched.


Chaining REPLACE() Calls

When multiple unwanted characters are present, REPLACE() calls can be nested.

SELECT
  raw_phone,
  REPLACE(
    REPLACE(
      REPLACE(TRIM(raw_phone), '-', ''),
    '(', ''),
  ')', '') AS phone_partial_clean
FROM customers_raw_text;

This removes:

  • leading and trailing spaces
  • hyphens
  • parentheses

Resulting Pattern

raw_phone phone_partial_clean
(374)-77-600-945 37477600945
077-600-045 077600045

At this stage, formatting noise is reduced, but the value is still not standardized.


Analytical Limitations of REPLACE()

While REPLACE() is useful, it has important limitations:

  • it works only on explicit characters
  • it does not understand patterns
  • it cannot validate correctness
  • rules must be hard-coded

For example:

  • it cannot remove “everything that is not a digit”
  • it cannot enforce length rules
  • it cannot adapt to new formats

Because of this, REPLACE() is rarely the final cleaning step.


When to Use REPLACE()

Use REPLACE() when:

  • the unwanted character is known
  • formatting rules are simple
  • transformations must be transparent

Avoid relying on REPLACE() when:

  • formats vary widely
  • patterns are unknown
  • validation is required
WarningAnalytical Warning | Hard-Coded Assumptions

Every REPLACE() call encodes an assumption:

  • which characters are noise
  • which characters are meaningful
  • what happens if formats change

These assumptions must be:

  • documented
  • reviewed
  • justified by business logic

REGEXP_REPLACE()

After handling explicit characters with REPLACE(), we move to pattern-based cleaning.

Real-world text often contains unknown or variable noise.
In such cases, hard-coded replacements do not scale.

REGEXP_REPLACE() allows you to define rules, not characters.


Why REGEXP_REPLACE() Matters in Analytics

Regular expressions let you describe what to keep or what to remove, rather than listing every unwanted symbol.

From an analytical perspective, this means:

  • fewer hard-coded assumptions
  • better generalization to unseen formats
  • cleaner, more robust standardization rules

This is especially important for identifiers such as:

  • phone numbers
  • IDs
  • codes
  • reference keys

REGEXP_REPLACE() | Syntax Overview

REGEXP_REPLACE(text, pattern, replacement [, flags])

Key components:

  • text → the input string
  • pattern → a regular expression
  • replacement → what to substitute
  • flags → optional modifiers (e.g. global replacement)

Extracting Digits Only

A very common analytical task is to remove everything except digits.

The regex pattern:

[^0-9]

\[\Downarrow\]

  • [] \(\rightarrow\) character set
  • ^ \(\rightarrow\) NOT
  • 0-9 \(\rightarrow\) digits

So the pattern matches any non-digit character.


Query | Digits-Only Cleaning

Let’s run the bellow code

SELECT
  raw_phone,
  REGEXP_REPLACE(raw_phone, '[^0-9]', '', 'g') AS digits_only
FROM customers_raw_text;

  • '[^0-9]': regular expression pattern that defines what to match.
  • '': Replacement
  • 'g':Global Flag, The 'g' flag stands for global replacement. Without 'g', only the first match would be replaced
NoteExperiment…

Feel free to experiment as much as you need!

Output | Conceptual Output

The above query would return

raw_phone digits_only
' 077600945 ' 077600945
'077600945' 077600945
'77600945' 77600945
'077-600-045' 077600045
'(374)-77-600-945' 37477600945
'37477600945' 37477600945

\[\Downarrow\]

  • all formatting symbols are removed
  • only numeric content remains
  • lengths still vary
ImportantPrimary Goal…

The goal here is not to memorize regex syntax, but to recognize reusable analytical patterns.

Removing Annotations Inside Parentheses

Text fields often contain annotations or comments that should not be part of the analytical key

Recall our dummy table:

customer_id raw_phone category_raw birth_date
1 077600945 Accessories (Promo) 1994-03-12
2 077-600-045 Electronics (Old) 1988-11-05
3 (374)-77-600-945 Accessories 2001-07-23
4 37477600945 Electronics (Promo) 1999-01-30
5 77600945 Accessories (Test) 1994-03-12

Text fields often include annotations inside parentheses that are useful for humans but harmful for analytics.

In analytics, these annotations usually break grouping and counting.

Examples:

  • Accessories (Promo)
  • Electronics (Old)
  • Gold Customer (Test)

From an analytical standpoint, these annotations:

  • fragment categories
  • break grouping logic
  • inflate distinct counts

Regex Pattern 1 | Remove Everything Inside ()

The following pattern removes:

  • the opening parenthesis (
  • everything inside
  • the closing parenthesis )
\([^)]*\)

Pattern breakdown:

  • \( → literal opening parenthesis
  • [^)]* → any characters except ) (zero or more)
  • \) → literal closing parenthesis
Query | Cleaning Categories
SELECT
    category_raw,
    REGEXP_REPLACE(category_raw, '\([^)]*\)', '', 'g')
   AS category_clean1
FROM customers_raw_text;
Output | Cleaning Categories
category_raw category_clean1
Accessories (Promo) Accessories
Electronics (Old) Electronics
Accessories Accessories
Electronics (Promo) Electronics
Accessories (Test) Accessories
CautionWhitespaces

Check the whitespaces!

Regex Pattern 2 | Remove Parentheses Content + TRIM Whitespaces

After removing annotations inside parentheses, trailing whitespaces often remain.
This leads to categories that look identical but fail equality checks.

To fix this, we chain TRIM() after REGEXP_REPLACE().

TRIM(REGEXP_REPLACE(category_raw, '\([^)]*\)', '', 'g'))

What this does:

  • removes everything inside ()
  • removes the parentheses themselves
  • trims leading and trailing spaces

Query | Cleaning Categories + TRIM
SELECT
    category_raw,
    TRIM(
      REGEXP_REPLACE(category_raw, '\([^)]*\)', '', 'g')
    ) AS category_clean2
FROM customers_raw_text;

Output | Cleaning Categories (After TRIM)
category_raw category_clean2
Accessories (Promo) Accessories
Electronics (Old) Electronics
Accessories Accessories
Electronics (Promo) Electronics
Accessories (Test) Accessories

NoteWhy TRIM Matters

Without TRIM(), values like 'Accessories ' and 'Accessories'
would be treated as different categories during grouping.


Regex Pattern 3 | Remove Parentheses + Leading Space in One Step

Instead of chaining functions, we can encode whitespace logic directly into the regex.

\s*\(.*?\)

Pattern breakdown:

  • \s* → zero or more whitespace characters
  • \( → opening parenthesis
  • .*? → any characters (non-greedy)
  • \) → closing parenthesis

This pattern removes:

  • the parentheses
  • the annotation
  • any space before the parenthesis

Query | Single-Step Category Cleaning
SELECT
    category_raw,
    REGEXP_REPLACE(category_raw, '\s*\(.*?\)', '', 'g')
      AS category_clean3
FROM customers_raw_text;

Output | Cleaning Categories (Single Regex)
category_raw category_clean3
Accessories (Promo) Accessories
Electronics (Old) Electronics
Accessories Accessories
Electronics (Promo) Electronics
Accessories (Test) Accessories

Comparison Summary

Approach Result Notes
Pattern 1 Accessories Leaves trailing spaces
Pattern 2 Accessories Safe and explicit
Pattern 3 Accessories Compact, regex-heavy
WarningAnalytical Guidance
  • Go over the Pattern 1 when learning, teaching or collaborating
  • Prefer Pattern 2 when teaching or collaborating
  • Prefer Pattern 3 when rules are stable and well-documented
  • Always validate results with COUNT(DISTINCT ...) after cleaning

Regex-based cleaning directly impacts grouping, KPIs, and dashboards.

REGEXP vs REPLACE | When to Choose Which

Scenario Prefer
Known characters only REPLACE()
Variable formatting REGEXP_REPLACE()
Validation and extraction REGEXP_REPLACE()
Transparency and simplicity REPLACE()

Analytical Warning

WarningAnalytical Warning | Regex Is Powerful but…

Regular expressions are powerful but dangerous if misused.

  • overly broad patterns can remove valid information
  • unreadable regex becomes technical debt
  • rules must be documented and tested

Always combine regex cleaning with measurement and validation.

SUBSTRING()

After cleaning whitespace, case, and structural noise, the next common analytical task is extracting meaningful parts of text.

SUBSTRING() allows you to select a specific portion of a string based on position or pattern.
This is essential when a single text field contains multiple pieces of information.

Unlike REPLACE() or REGEXP_REPLACE(), which remove noise,
SUBSTRING() is used to isolate signal.


Why SUBSTRING() Matters in Analytics

Many real-world text fields are compound fields, for example:

  • phone numbers with prefixes
  • codes with embedded meaning
  • identifiers with fixed segments
  • dates or versions encoded in strings

From an analytical perspective, this means:

  • grouping requires extracted keys
  • joins depend on consistent substrings
  • validation depends on expected positions

SUBSTRING() | Basic Positional Extraction

The most common form of SUBSTRING() uses starting position and length.

SUBSTRING(text FROM start_position FOR length)

Example 1 | Extract Last 8 Digits of Phone Numbers

Assume we have already removed non-digit characters and want to extract the core phone number.

Input

raw_phone
077600945
37477600945
77600945

\[\downarrow\]

SELECT
  raw_phone,
  SUBSTRING(raw_phone FROM LENGTH(raw_phone) - 7 FOR 8) AS phone_core
FROM customers_raw_text;

\[\downarrow\]

Output

raw_phone phone_core
077600945 77600945
37477600945 77600945
77600945 77600945

Analytical Interpretation

This logic:

  • assumes the last 8 digits uniquely identify the phone number
  • removes country codes and prefixes
  • creates a standardized join key

This pattern is extremely common in:

  • telecom analytics
  • customer deduplication
  • identity resolution

Example 2 | SUBSTRING() with Fixed Structure

When text follows a known fixed format, SUBSTRING() is deterministic and fast.

SELECT
  category_raw,
  SUBSTRING(category_raw FROM 1 FOR 11) AS category_prefix
FROM customers_raw_text;

\[\rightarrow\]

category_raw category_prefix
Accessories (Promo) Accessories
Electronics (Old) Electronics
Accessories Accessories
Electronics (Promo) Electronics
Accessories (Test) Accessories

This is useful when:

  • naming conventions are enforced
  • formats are guaranteed upstream

Example 3 | Extract Digits Using Regex

PostgreSQL also supports regex-based substring extraction.

SUBSTRING(text FROM pattern)

SELECT
  raw_phone,
  SUBSTRING(raw_phone FROM '[0-9]+') AS first_digit_sequence
FROM customers_raw_text;

This extracts the first continuous digit block.

\[\downarrow\]

raw_phone first_digit_sequence
077600945 077
077-600-045 077
(374)-77-600-945 374
37477600945 37477600945
77600945 77600945

Positional vs Pattern-Based SUBSTRING()

Use Case Prefer
Fixed-length identifiers Positional
Variable formats Regex-based
Performance-critical logic Positional
Unknown structure Regex-based

Analytical Best Practice

  • use positional SUBSTRING() when formats are stable
  • use regex SUBSTRING() when formats vary
  • always validate results with:
    • LENGTH()
    • COUNT(DISTINCT ...)

WarningAnalytical Warning | Hidden Assumptions

Every SUBSTRING() encodes assumptions about:

  • string length
  • format stability
  • semantic meaning of positions

If formats change upstream, SUBSTRING() can silently break logic.

Always document:

  • why a position was chosen
  • what happens if length changes

SUBSTRING() is a powerful extraction tool, but only when its assumptions are explicit and validated.

CONCAT() and ||

After extracting and cleaning text (TRIM(), LOWER(), INITCAP(), SUBSTRING()), a very common analytical task is to construct new text values from multiple columns.

This is where CONCAT() and the string concatenation operator || are used.

They allow you to build full identifiers, labels, and display fields from atomic components.


Why CONCAT() Matters in Analytics

In real analytical datasets, information is often split across columns:

  • first name + last name
  • city + zip code
  • category + subcategory
  • year + quarter

From an analytical perspective, this means:

  • reporting often requires combined fields
  • joins may rely on constructed keys
  • dashboards need human-readable labels

Concatenation is therefore a feature engineering operation, not just formatting.

CONCAT() | Syntax Overview

CONCAT(value1, value2, ..., valueN)

Key properties:

  • accepts multiple arguments
  • treats NULL as an empty string
  • always returns text

PostgreSQL also supports the operator form:

value1 || value2

Example 1 | Building Full Names (Raw)

Assume we have separate first_name and last_name columns.

SELECT
  customer_id,
  first_name,
  last_name,
  CONCAT(first_name, ' ', last_name) AS full_name
FROM customers_raw_text;

Output

customer_id first_name last_name full_name
1 john doe john doe
2 MARY SMITH MARY SMITH
3 aLEx johnson aLEx johnson
4 anna VAN DYKE anna VAN DYKE
5 NULL brown brown

Example 2 | CONCAT() + INITCAP() (Recommended Pattern)

A very common and safe analytical pattern is:

  1. normalize case
  2. then concatenate
SELECT
  customer_id,
  INITCAP(first_name) AS first_name_clean,
  INITCAP(last_name)  AS last_name_clean,
  INITCAP(CONCAT(
    first_name,
    ' ',
    last_name
  )) AS full_name_clean
FROM customers_raw_text;

Output

customer_id first_name_clean last_name_clean full_name_clean
1 John Doe John Doe
2 Mary Smith Mary Smith
3 Alex Johnson Alex Johnson
4 Anna Van Dyke Anna Van Dyke
5 NULL Brown Brown

CONCAT() vs ||

The same result can be achieved using the operator form.

SELECT
  customer_id,
  INITCAP(first_name) || ' ' || INITCAP(last_name) AS full_name_clean
FROM customers_raw_text;

Key Difference

  • CONCAT() treats NULL as empty
  • || returns NULL if any operand is NULL

This difference is analytically important.

  • use CONCAT() when NULL values are possible
  • use || when you want NULL propagation
  • always normalize text before concatenation
  • validate results with COUNT(DISTINCT ...)
WarningAnalytical Warning | Constructed Fields

Concatenated fields:

  • often look harmless
  • but frequently become join keys or grouping dimensions

If constructed inconsistently, they lead to:

  • duplicate entities
  • broken joins
  • misleading KPIs

Always document:

  • construction logic
  • ordering
  • casing rules

CONCAT() is a simple function, but it plays a critical role in building reliable analytical dimensions.

POSITION() / STRPOS()

The next analytical task is often to locate where something appears inside a string.

This is exactly what POSITION() and its PostgreSQL-specific alias STRPOS() do.

They answer a simple but powerful question:

Where does a given substring start?


Why POSITION() / STRPOS() Matters in Analytics

In analytical workflows, knowing the position of a character or substring allows you to:

  • validate expected formats
  • detect malformed values
  • drive conditional logic
  • prepare for controlled extraction

Typical analytical scenarios include:

  • checking if a delimiter exists
  • identifying country codes in phone numbers
  • detecting annotations in categories
  • validating structured identifiers

POSITION() | Syntax Overview

POSITION(substring IN text)
  • returns the 1-based position of substring
  • returns 0 if the substring is not found

PostgreSQL also supports:

STRPOS(text, substring)

Both behave identically.


Example 1 | Detect Parentheses in Categories

Before removing annotations, we may want to detect which rows contain them.

SELECT
  category_raw,
  POSITION('(' IN category_raw) AS open_paren_pos
FROM customers_raw_text;

Output

category_raw open_paren_pos
Accessories (Promo) 13
Electronics (Old) 13
Accessories 0
Electronics (Promo) 13
Accessories (Test) 13

Analytical Interpretation

  • open_paren_pos > 0 → annotation exists
  • open_paren_pos = 0 → clean category

This allows you to:

  • flag rows for cleaning,
  • apply conditional logic
  • measure data quality issues

Example 2 | STRPOS() for Phone Number Diagnostics

We can also use STRPOS() to detect formatting characters.

SELECT
  raw_phone,
  STRPOS(raw_phone, '-') AS hyphen_pos,
  STRPOS(raw_phone, '(') AS paren_pos

FROM customers_raw_text;

Output

raw_phone hyphen_pos paren_pos
077600945 0 0
077-600-045 4 0
(374)-77-600-945 6 1
37477600945 0 0
77600945 0 0

Analytical Interpretation

This immediately tells us:

  • which rows contain separators
  • which rows include country codes
  • which rows are already clean
Note

This is often done before applying REGEXP_REPLACE().


POSITION() as a Validation Tool

A common analytical pattern is to convert position checks into flags.

SELECT
  customer_id,
  category_raw,
  POSITION('(' IN category_raw) > 0 AS has_annotation
FROM customers_raw_text;

Conceptual Output

customer_id category_raw has_annotation
1 Accessories (Promo) true
2 Electronics (Old) true
3 Accessories false
4 Electronics (Promo) true
5 Accessories (Test) true

This is useful for:

  • data quality reporting
  • audit checks
  • conditional cleaning logic

POSITION() vs SUBSTRING()

Question Prefer
Where is it? POSITION() / STRPOS()
Extract it SUBSTRING()
Validate format POSITION()
Clean text REGEXP_REPLACE()

Analytical Best Practice

  • use POSITION() to measure and detect
  • use SUBSTRING() to extract
  • use REGEXP_REPLACE() to clean
  • never assume structure without checking positions

WarningAnalytical Warning | Silent Assumptions

Assuming a substring exists without checking its position leads to:

  • incorrect extractions
  • unexpected NULLs
  • silent logic failures

Always validate structure before relying on it.

POSITION() and STRPOS() are small functions with high analytical value when used for validation and diagnostics.

SPLIT_PART()

After locating structure with POSITION() and before applying complex regex logic,
a very common analytical task is to split a string by a known delimiter.

SPLIT_PART() extracts a specific segment from a delimited text field.
It is simpler and safer than regex when the structure is known and consistent.


Why SPLIT_PART() Matters in Analytics

In real datasets, multiple values are often stored in a single column, separated by delimiters.

  • phone numbers (077-600-045)
  • composite IDs (ORD-2024-001)
  • hierarchical categories (Electronics-Mobile-Android)
  • version strings (v1.2.5)

From an analytical perspective:

  • grouping often requires one specific segment
  • joins depend on consistent extracted parts
  • validation depends on expected segment positions

SPLIT_PART() | Syntax

SPLIT_PART(text, delimiter, position)
  • text → input string
  • delimiter → character or string used to split
  • position → 1-based index of the part to return

If the requested position does not exist, SPLIT_PART() returns an empty string.


Example 1 | Split Phone Numbers by Hyphen (First Part)

Input

raw_phone
077-600-045
(374)-77-600-945
77600945

\[\downarrow\]

SELECT
  raw_phone,
  SPLIT_PART(raw_phone, '-', 1) AS first_part
FROM customers_raw_text;

\[\downarrow\]

Output

raw_phone first_part
077-600-045 077
(374)-77-600-945 (374)
77600945 77600945

Example 2 | Split Phone Numbers by Hyphen (Second Part)

We can also extract the second segment.

\[\downarrow\]

SELECT
  raw_phone,
  SPLIT_PART(raw_phone, '-', 2) AS second_part
FROM customers_raw_text;

\[\downarrow\]

Output

raw_phone second_part
077-600-045 600
(374)-77-600-945 77
77600945

Analytical Interpretation

  • empty result means the delimiter does not exist
  • empty string is not NULL
  • malformed or inconsistent rows are exposed immediately

This step is often used as a diagnostic, not just extraction.


Important Behavior to Remember

  • missing segments return an empty string
  • empty strings can silently affect:
    • grouping
    • counting
    • joins

Because of this, SPLIT_PART() is often paired with NULLIF().


Example 3 | SPLIT_PART() After Cleaning

In practice, SPLIT_PART() is rarely used on raw data.

\[\downarrow\]

SELECT
  raw_phone,
  SPLIT_PART(
    REGEXP_REPLACE(raw_phone, '[^0-9-]', '', 'g'),
    '-',
    2
  ) AS clean_second_part
FROM customers_raw_text;

This combines:

  • pattern-based cleaning
  • delimiter-based extraction

SPLIT_PART() vs Other String Functions

Question Prefer
Known delimiter SPLIT_PART()
Fixed character positions SUBSTRING()
Variable formats REGEXP_REPLACE()
Structure validation POSITION()

Analytical Best Practice

  • validate delimiter presence with POSITION()
  • clean text before splitting
  • handle empty strings explicitly
  • never assume all rows share the same structure

WarningAnalytical Warning | Empty Strings

SPLIT_PART() returns empty strings, not NULL.

This can:

  • inflate COUNT(DISTINCT ...)
  • break joins
  • hide malformed records

Always validate results using:

  • LENGTH()
  • NULLIF()
  • COUNT(DISTINCT ...)

NULLIF()

After handling extraction, cleaning, and construction of text values, the next important analytical task is preventing misleading values from entering metrics and logic.

NULLIF() is used to convert specific values into NULL intentionally.

It answers the question:

When should a value be treated as “missing”?


Why NULLIF() Matters in Analytics

In real datasets, missing information is often encoded as fake values, such as:

  • empty strings ''
  • placeholders like 'N/A', 'UNKNOWN'
  • default values like '0'

From an analytical perspective, these values:

  • distort counts
  • break averages
  • pollute grouping keys
  • hide true data quality issues

NULLIF() allows you to restore semantic meaning by converting such placeholders into NULL.


NULLIF() | Syntax Overview

NULLIF(value, comparison_value)

How it works:

  • if value = comparison_value → returns NULL
  • otherwise → returns value

Example 1 | Empty Strings to NULL

Assume some customers have missing last names stored as empty strings.

SELECT
  customer_id,
  last_name,
  NULLIF(last_name, '') AS last_name_clean
FROM customers_raw_text;

Conceptual Output

customer_id last_name last_name_clean
1 doe doe
2 smith smith
3 johnson johnson
4 VAN DYKE VAN DYKE
5 NULL

Analytical Interpretation

  • empty strings are not meaningful values
  • converting them to NULL allows:
    • correct COUNT(last_name)
    • correct COUNT(DISTINCT last_name)
    • accurate completeness checks

Example 2 | Placeholder Text to NULL

Sometimes missing values are encoded explicitly.

SELECT
  category_raw,
  NULLIF(category_raw, 'UNKNOWN') AS category_clean
FROM customers_raw_text;

Example 3 | NULLIF() Before Aggregation

NULLIF() is often used before aggregation to prevent bad values from entering metrics.

SELECT
  COUNT(NULLIF(category_raw, 'UNKNOWN')) AS valid_categories
FROM customers_raw_text;

This ensures that placeholder values do not inflate counts.


NULLIF() with CONCAT() (Defensive Pattern)

When building labels, NULLIF() can prevent ugly outputs.

SELECT
  customer_id,
  CONCAT(
    first_name,
    ' ',
    NULLIF(last_name, '')
  ) AS full_name_safe
FROM customers_raw_text;

Without NULLIF(), empty strings may silently produce malformed labels.


NULLIF() vs COALESCE()

Goal Use
Convert bad value to NULL NULLIF()
Replace NULL with value COALESCE()
Restore missing semantics NULLIF()
Impute missing values COALESCE()

These functions are often used together, but serve opposite purposes.


Analytical Best Practice

Before aggregation or joins:

  • identify placeholder values
  • convert them to NULL using NULLIF()
  • then apply aggregation or imputation logic

This keeps business meaning intact.


WarningAnalytical Warning | Silent Placeholders

Placeholder values are more dangerous than NULLs.

  • they look valid
  • they pass filters
  • they corrupt metrics silently

Always audit text fields for placeholder values
before trusting analytical results.

NULLIF() is a small function with high analytical impact, especially in data quality and KPI correctness.

LEFT() / RIGHT()

After working with positional extraction (SUBSTRING()) and pattern-based logic, we often encounter text fields with a fixed directional structure.

In such cases, extracting characters from the beginning or the end of a string is clearer and safer than calculating positions.

This is exactly what LEFT() and RIGHT() are designed for.


Why LEFT() / RIGHT() Matter in Analytics

Many analytical identifiers encode meaning at the edges of a string:

  • country or operator prefixes
  • regional or product codes
  • fixed-length identifiers
  • version or batch suffixes

When structure is stable, directional extraction is:

  • more readable
  • less error-prone
  • easier to audit

Syntax Overview

LEFT(text, n)
RIGHT(text, n)
  • text → input string
  • n → number of characters to extract
  • extraction starts strictly from the left or right

Example 1 | Extract Phone Prefix with LEFT()

Assume phone numbers have already been cleaned to digits only
and the first 3 digits represent a country or operator code.

SELECT
  raw_phone,
  LEFT(
    REGEXP_REPLACE(raw_phone, '[^0-9]', '', 'g'),
    3
  ) AS phone_prefix
FROM customers_raw_text;

Output

raw_phone phone_prefix
077600945 077
077-600-045 077
(374)-77-600-945 374
37477600945 374
77600945 776

Analytical Interpretation

  • prefixes reveal country or operator
  • enables routing or segmentation logic
  • supports validation of expected formats

Example 2 | Extract Core Identifier with RIGHT()

If the last 8 digits uniquely identify a phone number,
RIGHT() provides a clean and expressive solution.

SELECT
  raw_phone,
  RIGHT(
    REGEXP_REPLACE(raw_phone, '[^0-9]', '', 'g'),
    8
  ) AS phone_core
FROM customers_raw_text;

Output

raw_phone phone_core
077600945 77600945
37477600945 77600945
77600945 77600945

LEFT() / RIGHT() vs SUBSTRING()

Scenario Prefer
Fixed prefix or suffix LEFT() / RIGHT()
Dynamic position SUBSTRING()
Readability LEFT() / RIGHT()
Complex extraction SUBSTRING()

Analytical Best Practice

  • apply cleaning before directional extraction
  • validate assumptions with LENGTH()
  • use LEFT()/RIGHT() only when format is guaranteed
  • document why a specific length was chosen

WarningAnalytical Warning | Assumed Structure

Directional extraction assumes:

  • stable formatting
  • consistent string length
  • no missing segments

If upstream formats change, LEFT() and RIGHT() can silently produce incorrect results.

Always validate structure before relying on directional logic.

LEFT() and RIGHT() are simple functions, but when used correctly, they provide clear, auditable, and efficient extraction for fixed-format analytical fields.

Dummy Dataset | Messy Transaction Text Data

This dataset simulates realistic dirty text data commonly seen in transactional systems.
It will be used for both in-class exercises and the take-home case study.

The table intentionally contains:

  • inconsistent phone formats
  • annotated categories
  • mixed casing
  • formatting symbols
  • duplicate logical entities

Step 1 | Create Dummy Transaction Table

DROP TABLE IF EXISTS transactions_text_demo;

CREATE TABLE transactions_text_demo (
  transaction_id INTEGER,
  customer_id    INTEGER,
  raw_phone      TEXT,
  category_raw   TEXT,
  quantity       INTEGER,
  price          NUMERIC(10,2)
);

Step 2 | Populate with 1,000 Rows of Messy Data

INSERT INTO transactions_text_demo
SELECT
  gs AS transaction_id,
  (RANDOM() * 200)::INT + 1 AS customer_id,

  CASE (gs % 6)
    WHEN 0 THEN '   077600945  '
    WHEN 1 THEN '077-600-045'
    WHEN 2 THEN '(374)-77-600-945'
    WHEN 3 THEN '37477600945'
    WHEN 4 THEN '77600945'
    ELSE '077600945'
  END AS raw_phone,

  CASE (gs % 5)
    WHEN 0 THEN 'Accessories (Promo)'
    WHEN 1 THEN 'Accessories (Test)'
    WHEN 2 THEN 'Electronics (Old)'
    WHEN 3 THEN 'Electronics (Promo)'
    ELSE 'Accessories'
  END AS category_raw,

  (RANDOM() * 5)::INT + 1 AS quantity,
  (RANDOM() * 500 + 10)::NUMERIC(10,2) AS price
FROM generate_series(1, 1000) AS gs;

Step 3 | Sanity Check

SELECT
  COUNT(*) AS total_rows,
  COUNT(DISTINCT raw_phone) AS distinct_raw_phones,
  COUNT(DISTINCT category_raw) AS distinct_categories
FROM transactions_text_demo;

In-Class Assignment

Objective

Use measurement-first SQL to understand how dirty text affects grouping and counts.


Task 1 | Phone Number Diagnostics

Write a query that shows:

  • raw_phone
  • LENGTH(raw_phone)
  • position of '-'
  • position of '('
  • count of rows per pattern

Use:

  • LENGTH()
  • POSITION() or STRPOS()
  • GROUP BY

Task 2 | Category Fragmentation

Write a query that returns:

  • category_raw
  • number of transactions per category
GROUP BY category_raw
ORDER BY COUNT(*) DESC;

Answer:

  • how many logical categories exist?
  • how many categories appear due to annotations?

Expected Insight

You should clearly see that:

  • grouping on raw text produces fragmented dimensions
  • phone numbers represent the same entity in multiple formats

Homework

Scenario

You are asked to produce reliable KPIs from transactions_text_demo.

Business questions:

  • revenue by category
  • number of unique customers
  • average transaction value

Part 1 | Profiling

Write SQL queries to assess:

  • phone number format diversity
  • category fragmentation
  • impact of dirty text on GROUP BY

Use only diagnostic functions.


Part 2 | Standardization Layer

Create a cleaned SELECT projection (no updates) that includes:

  • standardized phone number (last 8 digits)
  • cleaned category (no annotations, trimmed)
  • revenue per transaction

Use:

  • REGEXP_REPLACE()
  • TRIM()
  • SUBSTRING()
  • CONCAT() where relevant

Part 3 | KPI Comparison

Compute and compare:

  • revenue by raw category
  • revenue by cleaned category
  • unique customers (raw vs cleaned phone)

Use GROUP BY in all comparisons.


Part 4 | Analytical Explanation

Briefly explain:

  • why KPIs changed
  • which cleaning step had the biggest impact
  • what assumptions you made
  • what could silently break in production

Submission Rules

  • Create string_functions.sql
  • SQL only (no updates or deletes)
  • all cleaning must be done in SELECT queries
  • results must be reproducible
  • explanations must be concise and analytical
  • push it into GitHub

Important

This assignment tests analytical thinking, not just SQL syntax.

Dirty text → broken GROUP BY → misleading KPIs.