Session 06: DA with SQL | String Functions

SQL

String Functions

Text Functions | General Overview

Text (string) functions are used to inspect, clean, standardize, and transform textual data at the row level.

Unlike numeric aggregate functions (SUM, AVG, COUNT), most text functions:

operate per row
return derived text columns
do not reduce the number of rows on their own

In analytical workflows, text functions are typically applied before aggregation to ensure that grouping, counting, and summarization behave correctly.

Running Docker

Lets run the Docker in detached mode:

docker compose up -d

Why Text Functions Matter in Analytics

Real-world text data is often dirty, inconsistent, and ambiguous.

Common issues include:

inconsistent casing (Electronics vs electronics)
leading or trailing spaces
annotations inside text (Category (Promo))
mixed formats (ID-123, ID_123, 123)
missing or partially filled values

If these issues are not handled explicitly:

GROUP BY produces fragmented categories
COUNT(DISTINCT ...) overcounts entities
KPIs become inconsistent across dashboards
joins silently fail

Text functions are therefore data quality tools, not cosmetic helpers.

Typical Analytical Use Cases

Text functions are commonly used for:

data cleaning
- trimming spaces
- normalizing case
- removing annotations
standardization
- making categories comparable
- ensuring consistent grouping keys
feature engineering
- extracting parts of strings
- building derived text features
validation
- detecting malformed values
- identifying unexpected formats

Analytical Principle

Text transformations change how rows are grouped.

Because of this:

text cleaning is an analytical decision
not a purely technical step
and must be done intentionally and consistently

Dummy Setup | Raw Phone Numbers and Names

To demonstrate text normalization beyond phone numbers, let’s create dummy table which will cover the entire spector.

This allows us to later apply functions such as UPPER(), LOWER(), INITCAP(), and validation logic on names.

DROP TABLE IF EXISTS customers_raw_text;

CREATE TABLE customers_raw_text (
  customer_id   INTEGER,
  first_name    TEXT,
  last_name     TEXT,
  raw_phone     TEXT,
  category_raw  TEXT,
  birth_date    DATE
);

INSERT INTO customers_raw_text (
  customer_id,
  first_name,
  last_name,
  raw_phone,
  category_raw,
  birth_date
) VALUES
  (1, 'joHN',     'doE',        '   077600945  ',   'Accessories (Promo)', DATE '1994-03-12'),
  (2, 'MARY',     'sMiTh',      '077-600-045',      'Electronics (Old)',   DATE '1988-11-05'),
  (3, 'aLEx',     'johnSON',    '(374)-77-600-945', 'Accessories',         DATE '2001-07-23'),
  (4, 'anna',     'VAN DYKE',   '37477600945',      'Electronics (Promo)', DATE '1999-01-30'),
  (5, NULL,       'brOwn',      '77600945',         'Accessories (Test)',  DATE '1994-03-12');

What this setup gives us

phone numbers with inconsistent formatting
names with:
- random capitalization
- mixed casing within the same value
- multi-part last names
- unstructured category

This mirrors real-world dirty text data and prepares us for the next set of string functions focused on standardization and normalization.

LENGTH()

Text fields such as phone numbers often arrive in many inconsistent formats.
Before attempting to clean or standardize them, a good analyst first tries to detect which rows are malformed.

The LENGTH() function is a simple but powerful diagnostic tool for this purpose.

Inspecting the Lengths

The first question to ask is:

Do all values have the same length?

SELECT
  raw_phone,
  LENGTH(raw_phone) AS phone_length
FROM customers_raw_text;

\[\downarrow\]

Output

raw_phone	phone_length
`' 077600945 '`	13
`'077600945'`	9
`'77600945'`	8
`'077-600-045'`	11
`'(374)-77-600-945'`	15
`'37477600945'`	11

Interpretation

Even before cleaning, we can already classify rows:

8 characters: likely already standardized → 77600945
9 characters likely correct number with leading zero → 077600945
11 characters likely contains country code or separators
13+ characters clearly contains whitespace and formatting symbols

At this point:

spaces
hyphens
parentheses
country prefixes

Why This Step Is Critical

By using LENGTH() before cleaning, you can:

detect formatting issues early
identify multiple data quality patterns
avoid applying a one-size-fits-all transformation
design targeted cleaning rules

This avoids silent errors later when:

grouping
joining
deduplicating

Analytical Insight

Notice that:

LENGTH() alone already tells us which rows cannot possibly match the target
the target 77600945 has a clear expected length

This gives us a validation rule we can enforce after cleaning.

TRIM()

Text fields often contain invisible formatting noise, such as leading or trailing spaces.
Although these spaces are not obvious when visually inspecting data, they can:

break joins
fragment groups
cause false duplicates
invalidate validation rules

The TRIM() function is used to remove unnecessary whitespace from the beginning and end of text values.

Importantly, TRIM() removes noise, not meaning.

Why TRIM() Matters in Analytics

Whitespace has no business semantics.

That means:

removing it does not change the underlying value
but failing to remove it can distort analytical results

Because of this, TRIM() is considered a safe and low-risk transformation and is usually applied early in data-cleaning pipelines.

Applying TRIM()

Below we apply TRIM() to raw phone numbers and compare lengths before and after trimming.

SELECT
  raw_phone,
  LENGTH(raw_phone) AS length,
  TRIM(raw_phone) AS trimmed_phone,
  LENGTH(TRIM(raw_phone)) AS trimmed_length
FROM customers_raw_text;

Result Interpretation

From the output, we observe that:

leading and trailing spaces are removed
internal characters (digits, hyphens, parentheses) remain unchanged
some rows move closer to the expected target length
others still violate formatting rules

This confirms that:

whitespace was only one of several issues
further transformations are required for full standardization

Directional Variants

SQL also provides directional trimming functions:

LTRIM(raw_phone)   -- removes leading spaces only
RTRIM(raw_phone)   -- removes trailing spaces only

\[\downarrow\]

Using TRIM() is equivalent to applying both.

Analytical Best Practice

Always follow this pattern:

Measure (e.g., LENGTH())
Transform (e.g., TRIM())
Re-measure to validate impact

This makes transformations:

transparent
auditable
analytically defensible

Important

TRIM() removes invisible formatting noise
it does not alter business meaning
it prepares text data for structural cleaning
it should almost always be applied before more aggressive text transformations

Whitespace issues are subtle, but ignoring them leads to silent analytical errors.

LOWER(), UPPER(), INITCAP()

Text values often differ only by capitalization, even though they represent the same entity.

For example:

john, John, JOHN
van dyke, Van Dyke, VAN DYKE

From an analytical perspective, these differences are purely cosmetic, but if left untreated they lead to:

fragmented groups
inflated COUNT(DISTINCT ...)
failed joins
inconsistent reporting

SQL provides several functions to normalize text case.

Why Case Normalization Matters

Capitalization has no business meaning, but SQL treats differently cased strings as different values.

That means:

'john' ≠ 'John' ≠ 'JOHN'
grouping becomes unreliable
deduplication becomes inaccurate

Case normalization is therefore a data quality operation, not a formatting preference.

LOWER() | Normalize to Lowercase

LOWER() converts all characters in a string to lowercase.

This is the most common choice for:

grouping keys
joins
deduplication
categorical dimensions

SELECT
  first_name,
  LOWER(first_name) AS first_name_lower
FROM customers_raw_text;

When to Use

internal identifiers
grouping and joins
text comparison logic

UPPER() | Normalize to Uppercase

UPPER() converts all characters in a string to uppercase.

This is often used for:

codes
abbreviations
country or region identifiers

SELECT
  last_name,
  UPPER(last_name) AS last_name_upper
FROM customers_raw_text;

When to Use

standardized codes
reporting conventions
legacy system alignment

INITCAP() | Proper Name Formatting

INITCAP() converts text to title case:

first letter uppercase
remaining letters lowercase

SELECT
  first_name,
  INITCAP(first_name) AS first_name_clean
FROM customers_raw_text;

Example Transformations

raw value	INITCAP result
`john`	`John`
`mARy`	`Mary`
`van dyke`	`Van Dyke`

Analytical Comparison

Function	Best Use Case
`LOWER()`	joins, grouping, deduplication
`UPPER()`	codes, abbreviations
`INITCAP()`	presentation, names

Analytical Best Practice

A common and robust pattern is:

use LOWER() for keys and joins
use INITCAP() for display purposes
never mix raw and normalized text in analysis

Case normalization:

changes how rows group
affects counts and uniqueness
must be applied consistently

Like TRIM(), these functions remove noise, not meaning.

REPLACE()

After handling whitespace (TRIM()) and capitalization (LOWER(), UPPER(), INITCAP()), the next common problem in real-world text data is structural noise inside strings.

Structural noise refers to characters that are part of visual formatting but carry no analytical meaning.

Typical examples include:

hyphens in phone numbers
spaces used as separators
parentheses around country codes
dots, slashes, or underscores

If these characters are not removed, text values that represent the same entity remain incomparable.

Why REPLACE() Matters in Analytics

REPLACE() allows you to explicitly remove or substitute known characters.

From an analytical perspective:

the rule is deterministic
the transformation is transparent
the outcome is fully explainable

This makes REPLACE() suitable for early-stage structural cleanup, before applying more advanced pattern-based logic.

Basic Usage

The REPLACE() function has the following structure:

REPLACE(text, old_value, new_value)

It replaces all occurrences of old_value with new_value.

Removing Known Separators

Phone numbers often include hyphens for readability.

SELECT
  raw_phone,
  REPLACE(raw_phone, '-', '') AS phone_no_hyphen
FROM customers_raw_text;

\[\downarrow\]

raw_phone	phone_no_hyphen
`077-600-045`	`077600045`
`(374)-77-600-945`	`(374)77600945`

Only the hyphen is removed.
All other characters remain untouched.

Chaining REPLACE() Calls

When multiple unwanted characters are present, REPLACE() calls can be nested.

SELECT
  raw_phone,
  REPLACE(
    REPLACE(
      REPLACE(TRIM(raw_phone), '-', ''),
    '(', ''),
  ')', '') AS phone_partial_clean
FROM customers_raw_text;

This removes:

leading and trailing spaces
hyphens
parentheses

Resulting Pattern

raw_phone	phone_partial_clean
`(374)-77-600-945`	`37477600945`
`077-600-045`	`077600045`

At this stage, formatting noise is reduced, but the value is still not standardized.

Analytical Limitations of REPLACE()

While REPLACE() is useful, it has important limitations:

it works only on explicit characters
it does not understand patterns
it cannot validate correctness
rules must be hard-coded

For example:

it cannot remove “everything that is not a digit”
it cannot enforce length rules
it cannot adapt to new formats

Because of this, REPLACE() is rarely the final cleaning step.

When to Use REPLACE()

Use REPLACE() when:

the unwanted character is known
formatting rules are simple
transformations must be transparent

Avoid relying on REPLACE() when:

formats vary widely
patterns are unknown
validation is required

Analytical Warning | Hard-Coded Assumptions

Every REPLACE() call encodes an assumption:

which characters are noise
which characters are meaningful
what happens if formats change

These assumptions must be:

documented
reviewed
justified by business logic

`REGEXP_REPLACE()`

After handling explicit characters with REPLACE(), we move to pattern-based cleaning.

Real-world text often contains unknown or variable noise.
In such cases, hard-coded replacements do not scale.

REGEXP_REPLACE() allows you to define rules, not characters.

Why REGEXP_REPLACE() Matters in Analytics

Regular expressions let you describe what to keep or what to remove, rather than listing every unwanted symbol.

From an analytical perspective, this means:

fewer hard-coded assumptions
better generalization to unseen formats
cleaner, more robust standardization rules

This is especially important for identifiers such as:

phone numbers
IDs
codes
reference keys

REGEXP_REPLACE() | Syntax Overview

REGEXP_REPLACE(text, pattern, replacement [, flags])

Key components:

text → the input string
pattern → a regular expression
replacement → what to substitute
flags → optional modifiers (e.g. global replacement)

Extracting Digits Only

A very common analytical task is to remove everything except digits.

The regex pattern:

[^0-9]

\[\Downarrow\]

[] \(\rightarrow\) character set
^ \(\rightarrow\) NOT
0-9 \(\rightarrow\) digits

So the pattern matches any non-digit character.

Query | Digits-Only Cleaning

Let’s run the bellow code

SELECT
  raw_phone,
  REGEXP_REPLACE(raw_phone, '[^0-9]', '', 'g') AS digits_only
FROM customers_raw_text;

'[^0-9]': regular expression pattern that defines what to match.
'': Replacement
'g':Global Flag, The 'g' flag stands for global replacement. Without 'g', only the first match would be replaced

Experiment…

Feel free to experiment as much as you need!

Output | Conceptual Output

The above query would return

raw_phone	digits_only
`' 077600945 '`	`077600945`
`'077600945'`	`077600945`
`'77600945'`	`77600945`
`'077-600-045'`	`077600045`
`'(374)-77-600-945'`	`37477600945`
`'37477600945'`	`37477600945`

\[\Downarrow\]

all formatting symbols are removed
only numeric content remains
lengths still vary

Primary Goal…

The goal here is not to memorize regex syntax, but to recognize reusable analytical patterns.

Removing Annotations Inside Parentheses

Text fields often contain annotations or comments that should not be part of the analytical key

Recall our dummy table:

customer_id	raw_phone	category_raw	birth_date
1	`077600945`	Accessories (Promo)	1994-03-12
2	`077-600-045`	Electronics (Old)	1988-11-05
3	`(374)-77-600-945`	Accessories	2001-07-23
4	`37477600945`	Electronics (Promo)	1999-01-30
5	`77600945`	Accessories (Test)	1994-03-12

Text fields often include annotations inside parentheses that are useful for humans but harmful for analytics.

In analytics, these annotations usually break grouping and counting.

Examples:

Accessories (Promo)
Electronics (Old)
Gold Customer (Test)

From an analytical standpoint, these annotations:

fragment categories
break grouping logic
inflate distinct counts

Regex Pattern 1 | Remove Everything Inside `()`

The following pattern removes:

the opening parenthesis (
everything inside
the closing parenthesis )

\([^)]*\)

Pattern breakdown:

\( → literal opening parenthesis
[^)]* → any characters except ) (zero or more)
\) → literal closing parenthesis

Query | Cleaning Categories

SELECT
    category_raw,
    REGEXP_REPLACE(category_raw, '\([^)]*\)', '', 'g')
   AS category_clean1
FROM customers_raw_text;

Output | Cleaning Categories

category_raw	category_clean1
Accessories (Promo)	Accessories
Electronics (Old)	Electronics
Accessories	Accessories
Electronics (Promo)	Electronics
Accessories (Test)	Accessories

Whitespaces

Check the whitespaces!

Regex Pattern 2 | Remove Parentheses Content + TRIM Whitespaces

After removing annotations inside parentheses, trailing whitespaces often remain.
This leads to categories that look identical but fail equality checks.

To fix this, we chain TRIM() after REGEXP_REPLACE().

TRIM(REGEXP_REPLACE(category_raw, '\([^)]*\)', '', 'g'))

What this does:

removes everything inside ()
removes the parentheses themselves
trims leading and trailing spaces

Query | Cleaning Categories + TRIM

SELECT
    category_raw,
    TRIM(
      REGEXP_REPLACE(category_raw, '\([^)]*\)', '', 'g')
    ) AS category_clean2
FROM customers_raw_text;

Output | Cleaning Categories (After TRIM)

category_raw	category_clean2
Accessories (Promo)	Accessories
Electronics (Old)	Electronics
Accessories	Accessories
Electronics (Promo)	Electronics
Accessories (Test)	Accessories

Why TRIM Matters

Without TRIM(), values like 'Accessories ' and 'Accessories'
would be treated as different categories during grouping.

Regex Pattern 3 | Remove Parentheses + Leading Space in One Step

Instead of chaining functions, we can encode whitespace logic directly into the regex.

\s*\(.*?\)

Pattern breakdown:

\s* → zero or more whitespace characters
\( → opening parenthesis
.*? → any characters (non-greedy)
\) → closing parenthesis

This pattern removes:

the parentheses
the annotation
any space before the parenthesis

Query | Single-Step Category Cleaning

SELECT
    category_raw,
    REGEXP_REPLACE(category_raw, '\s*\(.*?\)', '', 'g')
      AS category_clean3
FROM customers_raw_text;

Output | Cleaning Categories (Single Regex)

category_raw	category_clean3
Accessories (Promo)	Accessories
Electronics (Old)	Electronics
Accessories	Accessories
Electronics (Promo)	Electronics
Accessories (Test)	Accessories

Comparison Summary

Approach	Result	Notes
Pattern 1	`Accessories`	Leaves trailing spaces
Pattern 2	`Accessories`	Safe and explicit
Pattern 3	`Accessories`	Compact, regex-heavy

Analytical Guidance

Go over the Pattern 1 when learning, teaching or collaborating
Prefer Pattern 2 when teaching or collaborating
Prefer Pattern 3 when rules are stable and well-documented
Always validate results with COUNT(DISTINCT ...) after cleaning

Regex-based cleaning directly impacts grouping, KPIs, and dashboards.

REGEXP vs REPLACE | When to Choose Which

Scenario	Prefer
Known characters only	`REPLACE()`
Variable formatting	`REGEXP_REPLACE()`
Validation and extraction	`REGEXP_REPLACE()`
Transparency and simplicity	`REPLACE()`

Analytical Warning

Analytical Warning | Regex Is Powerful but…

Regular expressions are powerful but dangerous if misused.

overly broad patterns can remove valid information
unreadable regex becomes technical debt
rules must be documented and tested

Always combine regex cleaning with measurement and validation.

`SUBSTRING()`

After cleaning whitespace, case, and structural noise, the next common analytical task is extracting meaningful parts of text.

SUBSTRING() allows you to select a specific portion of a string based on position or pattern.
This is essential when a single text field contains multiple pieces of information.

Unlike REPLACE() or REGEXP_REPLACE(), which remove noise,
SUBSTRING() is used to isolate signal.

Why SUBSTRING() Matters in Analytics

Many real-world text fields are compound fields, for example:

phone numbers with prefixes
codes with embedded meaning
identifiers with fixed segments
dates or versions encoded in strings

From an analytical perspective, this means:

grouping requires extracted keys
joins depend on consistent substrings
validation depends on expected positions

SUBSTRING() | Basic Positional Extraction

The most common form of SUBSTRING() uses starting position and length.

SUBSTRING(text FROM start_position FOR length)

Example 1 | Extract Last 8 Digits of Phone Numbers

Assume we have already removed non-digit characters and want to extract the core phone number.

Input

raw_phone
`077600945`
`37477600945`
`77600945`

\[\downarrow\]

SELECT
  raw_phone,
  SUBSTRING(raw_phone FROM LENGTH(raw_phone) - 7 FOR 8) AS phone_core
FROM customers_raw_text;

\[\downarrow\]

Output

raw_phone	phone_core
`077600945`	`77600945`
`37477600945`	`77600945`
`77600945`	`77600945`

Analytical Interpretation

This logic:

assumes the last 8 digits uniquely identify the phone number
removes country codes and prefixes
creates a standardized join key

This pattern is extremely common in:

telecom analytics
customer deduplication
identity resolution

Example 2 | SUBSTRING() with Fixed Structure

When text follows a known fixed format, SUBSTRING() is deterministic and fast.

SELECT
  category_raw,
  SUBSTRING(category_raw FROM 1 FOR 11) AS category_prefix
FROM customers_raw_text;

\[\rightarrow\]

category_raw	category_prefix
Accessories (Promo)	Accessories
Electronics (Old)	Electronics
Accessories	Accessories
Electronics (Promo)	Electronics
Accessories (Test)	Accessories

This is useful when:

naming conventions are enforced
formats are guaranteed upstream

Example 3 | Extract Digits Using Regex

PostgreSQL also supports regex-based substring extraction.

SUBSTRING(text FROM pattern)

SELECT
  raw_phone,
  SUBSTRING(raw_phone FROM '[0-9]+') AS first_digit_sequence
FROM customers_raw_text;

This extracts the first continuous digit block.

\[\downarrow\]

raw_phone	first_digit_sequence
`077600945`	`077`
`077-600-045`	`077`
`(374)-77-600-945`	`374`
`37477600945`	`37477600945`
`77600945`	`77600945`

Positional vs Pattern-Based SUBSTRING()

Use Case	Prefer
Fixed-length identifiers	Positional
Variable formats	Regex-based
Performance-critical logic	Positional
Unknown structure	Regex-based

Analytical Best Practice

use positional SUBSTRING() when formats are stable
use regex SUBSTRING() when formats vary
always validate results with:
- LENGTH()
- COUNT(DISTINCT ...)

Analytical Warning | Hidden Assumptions

Every SUBSTRING() encodes assumptions about:

string length
format stability
semantic meaning of positions

If formats change upstream, SUBSTRING() can silently break logic.

Always document:

why a position was chosen
what happens if length changes

SUBSTRING() is a powerful extraction tool, but only when its assumptions are explicit and validated.

`CONCAT()` and `||`

After extracting and cleaning text (TRIM(), LOWER(), INITCAP(), SUBSTRING()), a very common analytical task is to construct new text values from multiple columns.

This is where CONCAT() and the string concatenation operator || are used.

They allow you to build full identifiers, labels, and display fields from atomic components.

Why CONCAT() Matters in Analytics

In real analytical datasets, information is often split across columns:

first name + last name
city + zip code
category + subcategory
year + quarter

From an analytical perspective, this means:

reporting often requires combined fields
joins may rely on constructed keys
dashboards need human-readable labels

Concatenation is therefore a feature engineering operation, not just formatting.

CONCAT() | Syntax Overview

CONCAT(value1, value2, ..., valueN)

Key properties:

accepts multiple arguments
treats NULL as an empty string
always returns text

PostgreSQL also supports the operator form:

value1 || value2

Example 1 | Building Full Names (Raw)

Assume we have separate first_name and last_name columns.

SELECT
  customer_id,
  first_name,
  last_name,
  CONCAT(first_name, ' ', last_name) AS full_name
FROM customers_raw_text;

Output

customer_id	first_name	last_name	full_name
1	john	doe	john doe
2	MARY	SMITH	MARY SMITH
3	aLEx	johnson	aLEx johnson
4	anna	VAN DYKE	anna VAN DYKE
5	NULL	brown	brown

Example 2 | CONCAT() + INITCAP() (Recommended Pattern)

A very common and safe analytical pattern is:

normalize case
then concatenate

SELECT
  customer_id,
  INITCAP(first_name) AS first_name_clean,
  INITCAP(last_name)  AS last_name_clean,
  INITCAP(CONCAT(
    first_name,
    ' ',
    last_name
  )) AS full_name_clean
FROM customers_raw_text;

Output

customer_id	first_name_clean	last_name_clean	full_name_clean
1	John	Doe	John Doe
2	Mary	Smith	Mary Smith
3	Alex	Johnson	Alex Johnson
4	Anna	Van Dyke	Anna Van Dyke
5	NULL	Brown	Brown

CONCAT() vs `||`

The same result can be achieved using the operator form.

SELECT
  customer_id,
  INITCAP(first_name) || ' ' || INITCAP(last_name) AS full_name_clean
FROM customers_raw_text;

Key Difference

CONCAT() treats NULL as empty
|| returns NULL if any operand is NULL

This difference is analytically important.

use CONCAT() when NULL values are possible
use || when you want NULL propagation
always normalize text before concatenation
validate results with COUNT(DISTINCT ...)

Analytical Warning | Constructed Fields

Concatenated fields:

often look harmless
but frequently become join keys or grouping dimensions

If constructed inconsistently, they lead to:

duplicate entities
broken joins
misleading KPIs

Always document:

construction logic
ordering
casing rules

CONCAT() is a simple function, but it plays a critical role in building reliable analytical dimensions.

`POSITION()` / `STRPOS()`

The next analytical task is often to locate where something appears inside a string.

This is exactly what POSITION() and its PostgreSQL-specific alias STRPOS() do.

They answer a simple but powerful question:

Where does a given substring start?

Why POSITION() / STRPOS() Matters in Analytics

In analytical workflows, knowing the position of a character or substring allows you to:

validate expected formats
detect malformed values
drive conditional logic
prepare for controlled extraction

Typical analytical scenarios include:

checking if a delimiter exists
identifying country codes in phone numbers
detecting annotations in categories
validating structured identifiers

POSITION() | Syntax Overview

POSITION(substring IN text)

returns the 1-based position of substring
returns 0 if the substring is not found

PostgreSQL also supports:

STRPOS(text, substring)

Both behave identically.

Example 1 | Detect Parentheses in Categories

Before removing annotations, we may want to detect which rows contain them.

SELECT
  category_raw,
  POSITION('(' IN category_raw) AS open_paren_pos
FROM customers_raw_text;

Output

category_raw	open_paren_pos
Accessories (Promo)	13
Electronics (Old)	13
Accessories	0
Electronics (Promo)	13
Accessories (Test)	13

Analytical Interpretation

open_paren_pos > 0 → annotation exists
open_paren_pos = 0 → clean category

This allows you to:

flag rows for cleaning,
apply conditional logic
measure data quality issues

Example 2 | STRPOS() for Phone Number Diagnostics

We can also use STRPOS() to detect formatting characters.

SELECT
  raw_phone,
  STRPOS(raw_phone, '-') AS hyphen_pos,
  STRPOS(raw_phone, '(') AS paren_pos

FROM customers_raw_text;

Output

raw_phone	hyphen_pos	paren_pos
`077600945`	0	0
`077-600-045`	4	0
`(374)-77-600-945`	6	1
`37477600945`	0	0
`77600945`	0	0

Analytical Interpretation

This immediately tells us:

which rows contain separators
which rows include country codes
which rows are already clean

Note

This is often done before applying REGEXP_REPLACE().

POSITION() as a Validation Tool

A common analytical pattern is to convert position checks into flags.

SELECT
  customer_id,
  category_raw,
  POSITION('(' IN category_raw) > 0 AS has_annotation
FROM customers_raw_text;

Conceptual Output

customer_id	category_raw	has_annotation
1	Accessories (Promo)	true
2	Electronics (Old)	true
3	Accessories	false
4	Electronics (Promo)	true
5	Accessories (Test)	true

This is useful for:

data quality reporting
audit checks
conditional cleaning logic

POSITION() vs SUBSTRING()

Question	Prefer
Where is it?	`POSITION()` / `STRPOS()`
Extract it	`SUBSTRING()`
Validate format	`POSITION()`
Clean text	`REGEXP_REPLACE()`

Analytical Best Practice

use POSITION() to measure and detect
use SUBSTRING() to extract
use REGEXP_REPLACE() to clean
never assume structure without checking positions

Analytical Warning | Silent Assumptions

Assuming a substring exists without checking its position leads to:

incorrect extractions
unexpected NULLs
silent logic failures

Always validate structure before relying on it.

POSITION() and STRPOS() are small functions with high analytical value when used for validation and diagnostics.

`SPLIT_PART()`

After locating structure with POSITION() and before applying complex regex logic,
a very common analytical task is to split a string by a known delimiter.

SPLIT_PART() extracts a specific segment from a delimited text field.
It is simpler and safer than regex when the structure is known and consistent.

Why SPLIT_PART() Matters in Analytics

In real datasets, multiple values are often stored in a single column, separated by delimiters.

phone numbers (077-600-045)
composite IDs (ORD-2024-001)
hierarchical categories (Electronics-Mobile-Android)
version strings (v1.2.5)

From an analytical perspective:

grouping often requires one specific segment
joins depend on consistent extracted parts
validation depends on expected segment positions

SPLIT_PART() | Syntax

SPLIT_PART(text, delimiter, position)

text → input string
delimiter → character or string used to split
position → 1-based index of the part to return

If the requested position does not exist, SPLIT_PART() returns an empty string.

Example 1 | Split Phone Numbers by Hyphen (First Part)

Input

raw_phone
`077-600-045`
`(374)-77-600-945`
`77600945`

\[\downarrow\]

SELECT
  raw_phone,
  SPLIT_PART(raw_phone, '-', 1) AS first_part
FROM customers_raw_text;

\[\downarrow\]

Output

raw_phone	first_part
`077-600-045`	`077`
`(374)-77-600-945`	`(374)`
`77600945`	`77600945`

Example 2 | Split Phone Numbers by Hyphen (Second Part)

We can also extract the second segment.

\[\downarrow\]

SELECT
  raw_phone,
  SPLIT_PART(raw_phone, '-', 2) AS second_part
FROM customers_raw_text;

\[\downarrow\]

Output

raw_phone	second_part
`077-600-045`	`600`
`(374)-77-600-945`	`77`
`77600945`

Analytical Interpretation

empty result means the delimiter does not exist
empty string is not NULL
malformed or inconsistent rows are exposed immediately

This step is often used as a diagnostic, not just extraction.

Important Behavior to Remember

missing segments return an empty string
empty strings can silently affect:
- grouping
- counting
- joins

Because of this, SPLIT_PART() is often paired with NULLIF().

Example 3 | SPLIT_PART() After Cleaning

In practice, SPLIT_PART() is rarely used on raw data.

\[\downarrow\]

SELECT
  raw_phone,
  SPLIT_PART(
    REGEXP_REPLACE(raw_phone, '[^0-9-]', '', 'g'),
    '-',
    2
  ) AS clean_second_part
FROM customers_raw_text;

This combines:

pattern-based cleaning
delimiter-based extraction

SPLIT_PART() vs Other String Functions

Question	Prefer
Known delimiter	`SPLIT_PART()`
Fixed character positions	`SUBSTRING()`
Variable formats	`REGEXP_REPLACE()`
Structure validation	`POSITION()`

Analytical Best Practice

validate delimiter presence with POSITION()
clean text before splitting
handle empty strings explicitly
never assume all rows share the same structure

Analytical Warning | Empty Strings

SPLIT_PART() returns empty strings, not NULL.

This can:

inflate COUNT(DISTINCT ...)
break joins
hide malformed records

Always validate results using:

LENGTH()
NULLIF()
COUNT(DISTINCT ...)

`NULLIF()`

After handling extraction, cleaning, and construction of text values, the next important analytical task is preventing misleading values from entering metrics and logic.

NULLIF() is used to convert specific values into NULL intentionally.

It answers the question:

When should a value be treated as “missing”?

Why NULLIF() Matters in Analytics

In real datasets, missing information is often encoded as fake values, such as:

empty strings ''
placeholders like 'N/A', 'UNKNOWN'
default values like '0'

From an analytical perspective, these values:

distort counts
break averages
pollute grouping keys
hide true data quality issues

NULLIF() allows you to restore semantic meaning by converting such placeholders into NULL.

NULLIF() | Syntax Overview

NULLIF(value, comparison_value)

How it works:

if value = comparison_value → returns NULL
otherwise → returns value

Example 1 | Empty Strings to NULL

Assume some customers have missing last names stored as empty strings.

SELECT
  customer_id,
  last_name,
  NULLIF(last_name, '') AS last_name_clean
FROM customers_raw_text;

Conceptual Output

customer_id	last_name	last_name_clean
1	doe	doe
2	smith	smith
3	johnson	johnson
4	VAN DYKE	VAN DYKE
5		NULL

Analytical Interpretation

empty strings are not meaningful values
converting them to NULL allows:
- correct COUNT(last_name)
- correct COUNT(DISTINCT last_name)
- accurate completeness checks

Example 2 | Placeholder Text to NULL

Sometimes missing values are encoded explicitly.

SELECT
  category_raw,
  NULLIF(category_raw, 'UNKNOWN') AS category_clean
FROM customers_raw_text;

Example 3 | NULLIF() Before Aggregation

NULLIF() is often used before aggregation to prevent bad values from entering metrics.

SELECT
  COUNT(NULLIF(category_raw, 'UNKNOWN')) AS valid_categories
FROM customers_raw_text;

This ensures that placeholder values do not inflate counts.

NULLIF() with CONCAT() (Defensive Pattern)

When building labels, NULLIF() can prevent ugly outputs.

SELECT
  customer_id,
  CONCAT(
    first_name,
    ' ',
    NULLIF(last_name, '')
  ) AS full_name_safe
FROM customers_raw_text;

Without NULLIF(), empty strings may silently produce malformed labels.

NULLIF() vs COALESCE()

Goal	Use
Convert bad value to NULL	`NULLIF()`
Replace NULL with value	`COALESCE()`
Restore missing semantics	`NULLIF()`
Impute missing values	`COALESCE()`

These functions are often used together, but serve opposite purposes.

Analytical Best Practice

Before aggregation or joins:

identify placeholder values
convert them to NULL using NULLIF()
then apply aggregation or imputation logic

This keeps business meaning intact.

Analytical Warning | Silent Placeholders

Placeholder values are more dangerous than NULLs.

they look valid
they pass filters
they corrupt metrics silently

Always audit text fields for placeholder values
before trusting analytical results.

NULLIF() is a small function with high analytical impact, especially in data quality and KPI correctness.

`LEFT()` / `RIGHT()`

After working with positional extraction (SUBSTRING()) and pattern-based logic, we often encounter text fields with a fixed directional structure.

In such cases, extracting characters from the beginning or the end of a string is clearer and safer than calculating positions.

This is exactly what LEFT() and RIGHT() are designed for.

Why LEFT() / RIGHT() Matter in Analytics

Many analytical identifiers encode meaning at the edges of a string:

country or operator prefixes
regional or product codes
fixed-length identifiers
version or batch suffixes

When structure is stable, directional extraction is:

more readable
less error-prone
easier to audit

Syntax Overview

LEFT(text, n)
RIGHT(text, n)

text → input string
n → number of characters to extract
extraction starts strictly from the left or right

Example 1 | Extract Phone Prefix with LEFT()

Assume phone numbers have already been cleaned to digits only
and the first 3 digits represent a country or operator code.

SELECT
  raw_phone,
  LEFT(
    REGEXP_REPLACE(raw_phone, '[^0-9]', '', 'g'),
    3
  ) AS phone_prefix
FROM customers_raw_text;

Output

raw_phone	phone_prefix
`077600945`	077
`077-600-045`	077
`(374)-77-600-945`	374
`37477600945`	374
`77600945`	776

Analytical Interpretation

prefixes reveal country or operator
enables routing or segmentation logic
supports validation of expected formats

Example 2 | Extract Core Identifier with RIGHT()

If the last 8 digits uniquely identify a phone number,
RIGHT() provides a clean and expressive solution.

SELECT
  raw_phone,
  RIGHT(
    REGEXP_REPLACE(raw_phone, '[^0-9]', '', 'g'),
    8
  ) AS phone_core
FROM customers_raw_text;

Output

raw_phone	phone_core
`077600945`	77600945
`37477600945`	77600945
`77600945`	77600945

LEFT() / RIGHT() vs SUBSTRING()

Scenario	Prefer
Fixed prefix or suffix	LEFT() / RIGHT()
Dynamic position	SUBSTRING()
Readability	LEFT() / RIGHT()
Complex extraction	SUBSTRING()

Analytical Best Practice

apply cleaning before directional extraction
validate assumptions with LENGTH()
use LEFT()/RIGHT() only when format is guaranteed
document why a specific length was chosen

Analytical Warning | Assumed Structure

Directional extraction assumes:

stable formatting
consistent string length
no missing segments

If upstream formats change, LEFT() and RIGHT() can silently produce incorrect results.

Always validate structure before relying on directional logic.

LEFT() and RIGHT() are simple functions, but when used correctly, they provide clear, auditable, and efficient extraction for fixed-format analytical fields.

Dummy Dataset | Messy Transaction Text Data

This dataset simulates realistic dirty text data commonly seen in transactional systems.
It will be used for both in-class exercises and the take-home case study.

The table intentionally contains:

inconsistent phone formats
annotated categories
mixed casing
formatting symbols
duplicate logical entities

Step 1 | Create Dummy Transaction Table

DROP TABLE IF EXISTS transactions_text_demo;

CREATE TABLE transactions_text_demo (
  transaction_id INTEGER,
  customer_id    INTEGER,
  raw_phone      TEXT,
  category_raw   TEXT,
  quantity       INTEGER,
  price          NUMERIC(10,2)
);

Step 2 | Populate with 1,000 Rows of Messy Data

INSERT INTO transactions_text_demo
SELECT
  gs AS transaction_id,
  (RANDOM() * 200)::INT + 1 AS customer_id,

  CASE (gs % 6)
    WHEN 0 THEN '   077600945  '
    WHEN 1 THEN '077-600-045'
    WHEN 2 THEN '(374)-77-600-945'
    WHEN 3 THEN '37477600945'
    WHEN 4 THEN '77600945'
    ELSE '077600945'
  END AS raw_phone,

  CASE (gs % 5)
    WHEN 0 THEN 'Accessories (Promo)'
    WHEN 1 THEN 'Accessories (Test)'
    WHEN 2 THEN 'Electronics (Old)'
    WHEN 3 THEN 'Electronics (Promo)'
    ELSE 'Accessories'
  END AS category_raw,

  (RANDOM() * 5)::INT + 1 AS quantity,
  (RANDOM() * 500 + 10)::NUMERIC(10,2) AS price
FROM generate_series(1, 1000) AS gs;

Step 3 | Sanity Check

SELECT
  COUNT(*) AS total_rows,
  COUNT(DISTINCT raw_phone) AS distinct_raw_phones,
  COUNT(DISTINCT category_raw) AS distinct_categories
FROM transactions_text_demo;

In-Class Assignment

Objective

Use measurement-first SQL to understand how dirty text affects grouping and counts.

Task 1 | Phone Number Diagnostics

Write a query that shows:

raw_phone
LENGTH(raw_phone)
position of '-'
position of '('
count of rows per pattern

Use:

LENGTH()
POSITION() or STRPOS()
GROUP BY

Task 2 | Category Fragmentation

Write a query that returns:

category_raw
number of transactions per category

GROUP BY category_raw
ORDER BY COUNT(*) DESC;

Answer:

how many logical categories exist?
how many categories appear due to annotations?

Expected Insight

You should clearly see that:

grouping on raw text produces fragmented dimensions
phone numbers represent the same entity in multiple formats

Homework

Scenario

You are asked to produce reliable KPIs from transactions_text_demo.

Business questions:

revenue by category
number of unique customers
average transaction value

Part 1 | Profiling

Write SQL queries to assess:

phone number format diversity
category fragmentation
impact of dirty text on GROUP BY

Use only diagnostic functions.

Part 2 | Standardization Layer

Create a cleaned SELECT projection (no updates) that includes:

standardized phone number (last 8 digits)
cleaned category (no annotations, trimmed)
revenue per transaction

Use:

REGEXP_REPLACE()
TRIM()
SUBSTRING()
CONCAT() where relevant

Part 3 | KPI Comparison

Compute and compare:

revenue by raw category
revenue by cleaned category
unique customers (raw vs cleaned phone)

Use GROUP BY in all comparisons.

Part 4 | Analytical Explanation

Briefly explain:

why KPIs changed
which cleaning step had the biggest impact
what assumptions you made
what could silently break in production

Submission Rules

Create string_functions.sql
SQL only (no updates or deletes)
all cleaning must be done in SELECT queries
results must be reproducible
explanations must be concise and analytical
push it into GitHub

Important

This assignment tests analytical thinking, not just SQL syntax.

Dirty text → broken GROUP BY → misleading KPIs.

Text Functions | General Overview

Running Docker

Why Text Functions Matter in Analytics

Typical Analytical Use Cases

Analytical Principle

Dummy Setup | Raw Phone Numbers and Names

LENGTH()

Inspecting the Lengths

Interpretation

Why This Step Is Critical

Analytical Insight

TRIM()

Why TRIM() Matters in Analytics

Applying TRIM()

Result Interpretation

Directional Variants

Analytical Best Practice

LOWER(), UPPER(), INITCAP()

Why Case Normalization Matters

LOWER() | Normalize to Lowercase

UPPER() | Normalize to Uppercase

INITCAP() | Proper Name Formatting

Analytical Comparison

Analytical Best Practice

REPLACE()

Why REPLACE() Matters in Analytics

Basic Usage

Removing Known Separators

Chaining REPLACE() Calls

Analytical Limitations of REPLACE()

When to Use REPLACE()

REGEXP_REPLACE()

Why REGEXP_REPLACE() Matters in Analytics

REGEXP_REPLACE() | Syntax Overview

Extracting Digits Only

Query | Digits-Only Cleaning

Output | Conceptual Output

Removing Annotations Inside Parentheses

Regex Pattern 1 | Remove Everything Inside ()

Query | Cleaning Categories

Output | Cleaning Categories

Regex Pattern 2 | Remove Parentheses Content + TRIM Whitespaces

Query | Cleaning Categories + TRIM

Output | Cleaning Categories (After TRIM)

Regex Pattern 3 | Remove Parentheses + Leading Space in One Step

Query | Single-Step Category Cleaning

Output | Cleaning Categories (Single Regex)

Comparison Summary

REGEXP vs REPLACE | When to Choose Which

Analytical Warning

SUBSTRING()

Why SUBSTRING() Matters in Analytics

SUBSTRING() | Basic Positional Extraction

Example 1 | Extract Last 8 Digits of Phone Numbers

Analytical Interpretation

Example 2 | SUBSTRING() with Fixed Structure

Example 3 | Extract Digits Using Regex

Positional vs Pattern-Based SUBSTRING()

Analytical Best Practice

CONCAT() and ||

Why CONCAT() Matters in Analytics

CONCAT() | Syntax Overview

Example 1 | Building Full Names (Raw)

Output

Example 2 | CONCAT() + INITCAP() (Recommended Pattern)

Output

CONCAT() vs ||

Key Difference

POSITION() / STRPOS()

Why POSITION() / STRPOS() Matters in Analytics

POSITION() | Syntax Overview

Example 1 | Detect Parentheses in Categories

Analytical Interpretation

Example 2 | STRPOS() for Phone Number Diagnostics

Analytical Interpretation

POSITION() as a Validation Tool

POSITION() vs SUBSTRING()

Analytical Best Practice

SPLIT_PART()

Why SPLIT_PART() Matters in Analytics

`REGEXP_REPLACE()`

Regex Pattern 1 | Remove Everything Inside `()`

`SUBSTRING()`

`CONCAT()` and `||`

CONCAT() vs `||`

`POSITION()` / `STRPOS()`

`SPLIT_PART()`

`NULLIF()`

`LEFT()` / `RIGHT()`