Session 06: DA with SQL | String Functions
Text Functions | General Overview
Text (string) functions are used to inspect, clean, standardize, and transform textual data at the row level.
Unlike numeric aggregate functions (SUM, AVG, COUNT), most text functions:
- operate per row
- return derived text columns
- do not reduce the number of rows on their own
In analytical workflows, text functions are typically applied before aggregation to ensure that grouping, counting, and summarization behave correctly.
Running Docker
Lets run the Docker in detached mode:
docker compose up -dWhy Text Functions Matter in Analytics
Real-world text data is often dirty, inconsistent, and ambiguous.
Common issues include:
- inconsistent casing (
Electronicsvselectronics) - leading or trailing spaces
- annotations inside text (
Category (Promo)) - mixed formats (
ID-123,ID_123,123) - missing or partially filled values
If these issues are not handled explicitly:
GROUP BYproduces fragmented categoriesCOUNT(DISTINCT ...)overcounts entities- KPIs become inconsistent across dashboards
- joins silently fail
Text functions are therefore data quality tools, not cosmetic helpers.
Typical Analytical Use Cases
Text functions are commonly used for:
- data cleaning
- trimming spaces
- normalizing case
- removing annotations
- standardization
- making categories comparable
- ensuring consistent grouping keys
- feature engineering
- extracting parts of strings
- building derived text features
- validation
- detecting malformed values
- identifying unexpected formats
Analytical Principle
Text transformations change how rows are grouped.
Because of this:
- text cleaning is an analytical decision
- not a purely technical step
- and must be done intentionally and consistently
Dummy Setup | Raw Phone Numbers and Names
To demonstrate text normalization beyond phone numbers, let’s create dummy table which will cover the entire spector.
This allows us to later apply functions such as UPPER(), LOWER(), INITCAP(), and validation logic on names.
DROP TABLE IF EXISTS customers_raw_text;
CREATE TABLE customers_raw_text (
customer_id INTEGER,
first_name TEXT,
last_name TEXT,
raw_phone TEXT,
category_raw TEXT,
birth_date DATE
);
INSERT INTO customers_raw_text (
customer_id,
first_name,
last_name,
raw_phone,
category_raw,
birth_date
) VALUES
(1, 'joHN', 'doE', ' 077600945 ', 'Accessories (Promo)', DATE '1994-03-12'),
(2, 'MARY', 'sMiTh', '077-600-045', 'Electronics (Old)', DATE '1988-11-05'),
(3, 'aLEx', 'johnSON', '(374)-77-600-945', 'Accessories', DATE '2001-07-23'),
(4, 'anna', 'VAN DYKE', '37477600945', 'Electronics (Promo)', DATE '1999-01-30'),
(5, NULL, 'brOwn', '77600945', 'Accessories (Test)', DATE '1994-03-12');What this setup gives us
- phone numbers with inconsistent formatting
- names with:
- random capitalization
- mixed casing within the same value
- multi-part last names
- unstructured category
- random capitalization
This mirrors real-world dirty text data and prepares us for the next set of string functions focused on standardization and normalization.
LENGTH()
Text fields such as phone numbers often arrive in many inconsistent formats.
Before attempting to clean or standardize them, a good analyst first tries to detect which rows are malformed.
The LENGTH() function is a simple but powerful diagnostic tool for this purpose.
Inspecting the Lengths
The first question to ask is:
Do all values have the same length?
SELECT
raw_phone,
LENGTH(raw_phone) AS phone_length
FROM customers_raw_text;\[\downarrow\]
Output
| raw_phone | phone_length |
|---|---|
' 077600945 ' |
13 |
'077600945' |
9 |
'77600945' |
8 |
'077-600-045' |
11 |
'(374)-77-600-945' |
15 |
'37477600945' |
11 |
Interpretation
Even before cleaning, we can already classify rows:
- 8 characters: likely already standardized →
77600945 - 9 characters likely correct number with leading zero →
077600945 - 11 characters likely contains country code or separators
- 13+ characters clearly contains whitespace and formatting symbols
At this point:
- spaces
- hyphens
- parentheses
- country prefixes
Why This Step Is Critical
By using LENGTH() before cleaning, you can:
- detect formatting issues early
- identify multiple data quality patterns
- avoid applying a one-size-fits-all transformation
- design targeted cleaning rules
This avoids silent errors later when:
- grouping
- joining
- deduplicating
Analytical Insight
Notice that:
LENGTH()alone already tells us which rows cannot possibly match the target- the target
77600945has a clear expected length
This gives us a validation rule we can enforce after cleaning.
TRIM()
Text fields often contain invisible formatting noise, such as leading or trailing spaces.
Although these spaces are not obvious when visually inspecting data, they can:
- break joins
- fragment groups
- cause false duplicates
- invalidate validation rules
The TRIM() function is used to remove unnecessary whitespace from the beginning and end of text values.
Importantly, TRIM() removes noise, not meaning.
Why TRIM() Matters in Analytics
Whitespace has no business semantics.
That means:
- removing it does not change the underlying value
- but failing to remove it can distort analytical results
Because of this, TRIM() is considered a safe and low-risk transformation and is usually applied early in data-cleaning pipelines.
Applying TRIM()
Below we apply TRIM() to raw phone numbers and compare lengths before and after trimming.
SELECT
raw_phone,
LENGTH(raw_phone) AS length,
TRIM(raw_phone) AS trimmed_phone,
LENGTH(TRIM(raw_phone)) AS trimmed_length
FROM customers_raw_text;Result Interpretation
From the output, we observe that:
- leading and trailing spaces are removed
- internal characters (digits, hyphens, parentheses) remain unchanged
- some rows move closer to the expected target length
- others still violate formatting rules
This confirms that:
- whitespace was only one of several issues
- further transformations are required for full standardization
Directional Variants
SQL also provides directional trimming functions:
LTRIM(raw_phone) -- removes leading spaces only
RTRIM(raw_phone) -- removes trailing spaces only\[\downarrow\]
Using TRIM() is equivalent to applying both.
Analytical Best Practice
Always follow this pattern:
- Measure (e.g.,
LENGTH()) - Transform (e.g.,
TRIM()) - Re-measure to validate impact
This makes transformations:
- transparent
- auditable
- analytically defensible
TRIM()removes invisible formatting noise
- it does not alter business meaning
- it prepares text data for structural cleaning
- it should almost always be applied before more aggressive text transformations
Whitespace issues are subtle, but ignoring them leads to silent analytical errors.
LOWER(), UPPER(), INITCAP()
Text values often differ only by capitalization, even though they represent the same entity.
For example:
john,John,JOHNvan dyke,Van Dyke,VAN DYKE
From an analytical perspective, these differences are purely cosmetic, but if left untreated they lead to:
- fragmented groups
- inflated
COUNT(DISTINCT ...)
- failed joins
- inconsistent reporting
SQL provides several functions to normalize text case.
Why Case Normalization Matters
Capitalization has no business meaning, but SQL treats differently cased strings as different values.
That means:
'john' ≠ 'John' ≠ 'JOHN'- grouping becomes unreliable
- deduplication becomes inaccurate
Case normalization is therefore a data quality operation, not a formatting preference.
LOWER() | Normalize to Lowercase
LOWER() converts all characters in a string to lowercase.
This is the most common choice for:
- grouping keys
- joins
- deduplication
- categorical dimensions
SELECT
first_name,
LOWER(first_name) AS first_name_lower
FROM customers_raw_text;When to Use
- internal identifiers
- grouping and joins
- text comparison logic
UPPER() | Normalize to Uppercase
UPPER() converts all characters in a string to uppercase.
This is often used for:
- codes
- abbreviations
- country or region identifiers
SELECT
last_name,
UPPER(last_name) AS last_name_upper
FROM customers_raw_text;When to Use
- standardized codes
- reporting conventions
- legacy system alignment
INITCAP() | Proper Name Formatting
INITCAP() converts text to title case:
- first letter uppercase
- remaining letters lowercase
SELECT
first_name,
INITCAP(first_name) AS first_name_clean
FROM customers_raw_text;Example Transformations
| raw value | INITCAP result |
|---|---|
john |
John |
mARy |
Mary |
van dyke |
Van Dyke |
Analytical Comparison
| Function | Best Use Case |
|---|---|
LOWER() |
joins, grouping, deduplication |
UPPER() |
codes, abbreviations |
INITCAP() |
presentation, names |
Analytical Best Practice
A common and robust pattern is:
- use
LOWER()for keys and joins - use
INITCAP()for display purposes - never mix raw and normalized text in analysis
Case normalization:
- changes how rows group
- affects counts and uniqueness
- must be applied consistently
Like TRIM(), these functions remove noise, not meaning.
REPLACE()
After handling whitespace (TRIM()) and capitalization (LOWER(), UPPER(), INITCAP()), the next common problem in real-world text data is structural noise inside strings.
Structural noise refers to characters that are part of visual formatting but carry no analytical meaning.
Typical examples include:
- hyphens in phone numbers
- spaces used as separators
- parentheses around country codes
- dots, slashes, or underscores
If these characters are not removed, text values that represent the same entity remain incomparable.
Why REPLACE() Matters in Analytics
REPLACE() allows you to explicitly remove or substitute known characters.
From an analytical perspective:
- the rule is deterministic
- the transformation is transparent
- the outcome is fully explainable
This makes REPLACE() suitable for early-stage structural cleanup, before applying more advanced pattern-based logic.
Basic Usage
The REPLACE() function has the following structure:
REPLACE(text, old_value, new_value)It replaces all occurrences of old_value with new_value.
Removing Known Separators
Phone numbers often include hyphens for readability.
SELECT
raw_phone,
REPLACE(raw_phone, '-', '') AS phone_no_hyphen
FROM customers_raw_text;\[\downarrow\]
| raw_phone | phone_no_hyphen |
|---|---|
077-600-045 |
077600045 |
(374)-77-600-945 |
(374)77600945 |
Only the hyphen is removed.
All other characters remain untouched.
Chaining REPLACE() Calls
When multiple unwanted characters are present, REPLACE() calls can be nested.
SELECT
raw_phone,
REPLACE(
REPLACE(
REPLACE(TRIM(raw_phone), '-', ''),
'(', ''),
')', '') AS phone_partial_clean
FROM customers_raw_text;This removes:
- leading and trailing spaces
- hyphens
- parentheses
Resulting Pattern
| raw_phone | phone_partial_clean |
|---|---|
(374)-77-600-945 |
37477600945 |
077-600-045 |
077600045 |
At this stage, formatting noise is reduced, but the value is still not standardized.
Analytical Limitations of REPLACE()
While REPLACE() is useful, it has important limitations:
- it works only on explicit characters
- it does not understand patterns
- it cannot validate correctness
- rules must be hard-coded
For example:
- it cannot remove “everything that is not a digit”
- it cannot enforce length rules
- it cannot adapt to new formats
Because of this, REPLACE() is rarely the final cleaning step.
When to Use REPLACE()
Use REPLACE() when:
- the unwanted character is known
- formatting rules are simple
- transformations must be transparent
Avoid relying on REPLACE() when:
- formats vary widely
- patterns are unknown
- validation is required
Every REPLACE() call encodes an assumption:
- which characters are noise
- which characters are meaningful
- what happens if formats change
These assumptions must be:
- documented
- reviewed
- justified by business logic
REGEXP_REPLACE()
After handling explicit characters with REPLACE(), we move to pattern-based cleaning.
Real-world text often contains unknown or variable noise.
In such cases, hard-coded replacements do not scale.
REGEXP_REPLACE() allows you to define rules, not characters.
Why REGEXP_REPLACE() Matters in Analytics
Regular expressions let you describe what to keep or what to remove, rather than listing every unwanted symbol.
From an analytical perspective, this means:
- fewer hard-coded assumptions
- better generalization to unseen formats
- cleaner, more robust standardization rules
This is especially important for identifiers such as:
- phone numbers
- IDs
- codes
- reference keys
REGEXP_REPLACE() | Syntax Overview
REGEXP_REPLACE(text, pattern, replacement [, flags])Key components:
- text → the input string
- pattern → a regular expression
- replacement → what to substitute
- flags → optional modifiers (e.g. global replacement)
Extracting Digits Only
A very common analytical task is to remove everything except digits.
The regex pattern:
[^0-9]
\[\Downarrow\]
[]\(\rightarrow\) character set
^\(\rightarrow\) NOT
0-9\(\rightarrow\) digits
So the pattern matches any non-digit character.
Query | Digits-Only Cleaning
Let’s run the bellow code
SELECT
raw_phone,
REGEXP_REPLACE(raw_phone, '[^0-9]', '', 'g') AS digits_only
FROM customers_raw_text;'[^0-9]':regular expression pattern that defines what to match.'':Replacement'g':Global Flag, The'g'flag stands for global replacement. Without'g', only the first match would be replaced
Feel free to experiment as much as you need!
Output | Conceptual Output
The above query would return
| raw_phone | digits_only |
|---|---|
' 077600945 ' |
077600945 |
'077600945' |
077600945 |
'77600945' |
77600945 |
'077-600-045' |
077600045 |
'(374)-77-600-945' |
37477600945 |
'37477600945' |
37477600945 |
\[\Downarrow\]
- all formatting symbols are removed
- only numeric content remains
- lengths still vary
The goal here is not to memorize regex syntax, but to recognize reusable analytical patterns.
Removing Annotations Inside Parentheses
Text fields often contain annotations or comments that should not be part of the analytical key
Recall our dummy table:
| customer_id | raw_phone | category_raw | birth_date |
|---|---|---|---|
| 1 | 077600945 |
Accessories (Promo) | 1994-03-12 |
| 2 | 077-600-045 |
Electronics (Old) | 1988-11-05 |
| 3 | (374)-77-600-945 |
Accessories | 2001-07-23 |
| 4 | 37477600945 |
Electronics (Promo) | 1999-01-30 |
| 5 | 77600945 |
Accessories (Test) | 1994-03-12 |
Text fields often include annotations inside parentheses that are useful for humans but harmful for analytics.
In analytics, these annotations usually break grouping and counting.
Examples:
Accessories (Promo)Electronics (Old)Gold Customer (Test)
From an analytical standpoint, these annotations:
- fragment categories
- break grouping logic
- inflate distinct counts
Regex Pattern 1 | Remove Everything Inside ()
The following pattern removes:
- the opening parenthesis
(
- everything inside
- the closing parenthesis
)
\([^)]*\)
Pattern breakdown:
\(→ literal opening parenthesis
[^)]*→ any characters except)(zero or more)
\)→ literal closing parenthesis
Query | Cleaning Categories
SELECT
category_raw,
REGEXP_REPLACE(category_raw, '\([^)]*\)', '', 'g')
AS category_clean1
FROM customers_raw_text;Output | Cleaning Categories
| category_raw | category_clean1 |
|---|---|
| Accessories (Promo) | Accessories |
| Electronics (Old) | Electronics |
| Accessories | Accessories |
| Electronics (Promo) | Electronics |
| Accessories (Test) | Accessories |
Check the whitespaces!
Regex Pattern 2 | Remove Parentheses Content + TRIM Whitespaces
After removing annotations inside parentheses, trailing whitespaces often remain.
This leads to categories that look identical but fail equality checks.
To fix this, we chain TRIM() after REGEXP_REPLACE().
TRIM(REGEXP_REPLACE(category_raw, '\([^)]*\)', '', 'g'))
What this does:
- removes everything inside
()
- removes the parentheses themselves
- trims leading and trailing spaces
Query | Cleaning Categories + TRIM
SELECT
category_raw,
TRIM(
REGEXP_REPLACE(category_raw, '\([^)]*\)', '', 'g')
) AS category_clean2
FROM customers_raw_text;Output | Cleaning Categories (After TRIM)
| category_raw | category_clean2 |
|---|---|
| Accessories (Promo) | Accessories |
| Electronics (Old) | Electronics |
| Accessories | Accessories |
| Electronics (Promo) | Electronics |
| Accessories (Test) | Accessories |
Without TRIM(), values like 'Accessories ' and 'Accessories'
would be treated as different categories during grouping.
Regex Pattern 3 | Remove Parentheses + Leading Space in One Step
Instead of chaining functions, we can encode whitespace logic directly into the regex.
\s*\(.*?\)
Pattern breakdown:
\s*→ zero or more whitespace characters
\(→ opening parenthesis
.*?→ any characters (non-greedy)
\)→ closing parenthesis
This pattern removes:
- the parentheses
- the annotation
- any space before the parenthesis
Query | Single-Step Category Cleaning
SELECT
category_raw,
REGEXP_REPLACE(category_raw, '\s*\(.*?\)', '', 'g')
AS category_clean3
FROM customers_raw_text;Output | Cleaning Categories (Single Regex)
| category_raw | category_clean3 |
|---|---|
| Accessories (Promo) | Accessories |
| Electronics (Old) | Electronics |
| Accessories | Accessories |
| Electronics (Promo) | Electronics |
| Accessories (Test) | Accessories |
Comparison Summary
| Approach | Result | Notes |
|---|---|---|
| Pattern 1 | Accessories |
Leaves trailing spaces |
| Pattern 2 | Accessories |
Safe and explicit |
| Pattern 3 | Accessories |
Compact, regex-heavy |
- Go over the Pattern 1 when learning, teaching or collaborating
- Prefer Pattern 2 when teaching or collaborating
- Prefer Pattern 3 when rules are stable and well-documented
- Always validate results with
COUNT(DISTINCT ...)after cleaning
Regex-based cleaning directly impacts grouping, KPIs, and dashboards.
REGEXP vs REPLACE | When to Choose Which
| Scenario | Prefer |
|---|---|
| Known characters only | REPLACE() |
| Variable formatting | REGEXP_REPLACE() |
| Validation and extraction | REGEXP_REPLACE() |
| Transparency and simplicity | REPLACE() |
Analytical Warning
Regular expressions are powerful but dangerous if misused.
- overly broad patterns can remove valid information
- unreadable regex becomes technical debt
- rules must be documented and tested
Always combine regex cleaning with measurement and validation.
SUBSTRING()
After cleaning whitespace, case, and structural noise, the next common analytical task is extracting meaningful parts of text.
SUBSTRING() allows you to select a specific portion of a string based on position or pattern.
This is essential when a single text field contains multiple pieces of information.
Unlike REPLACE() or REGEXP_REPLACE(), which remove noise,
SUBSTRING() is used to isolate signal.
Why SUBSTRING() Matters in Analytics
Many real-world text fields are compound fields, for example:
- phone numbers with prefixes
- codes with embedded meaning
- identifiers with fixed segments
- dates or versions encoded in strings
From an analytical perspective, this means:
- grouping requires extracted keys
- joins depend on consistent substrings
- validation depends on expected positions
SUBSTRING() | Basic Positional Extraction
The most common form of SUBSTRING() uses starting position and length.
SUBSTRING(text FROM start_position FOR length)Example 1 | Extract Last 8 Digits of Phone Numbers
Assume we have already removed non-digit characters and want to extract the core phone number.
Input
| raw_phone |
|---|
077600945 |
37477600945 |
77600945 |
\[\downarrow\]
SELECT
raw_phone,
SUBSTRING(raw_phone FROM LENGTH(raw_phone) - 7 FOR 8) AS phone_core
FROM customers_raw_text;\[\downarrow\]
Output
| raw_phone | phone_core |
|---|---|
077600945 |
77600945 |
37477600945 |
77600945 |
77600945 |
77600945 |
Analytical Interpretation
This logic:
- assumes the last 8 digits uniquely identify the phone number
- removes country codes and prefixes
- creates a standardized join key
This pattern is extremely common in:
- telecom analytics
- customer deduplication
- identity resolution
Example 2 | SUBSTRING() with Fixed Structure
When text follows a known fixed format, SUBSTRING() is deterministic and fast.
SELECT
category_raw,
SUBSTRING(category_raw FROM 1 FOR 11) AS category_prefix
FROM customers_raw_text;\[\rightarrow\]
| category_raw | category_prefix |
|---|---|
| Accessories (Promo) | Accessories |
| Electronics (Old) | Electronics |
| Accessories | Accessories |
| Electronics (Promo) | Electronics |
| Accessories (Test) | Accessories |
This is useful when:
- naming conventions are enforced
- formats are guaranteed upstream
Example 3 | Extract Digits Using Regex
PostgreSQL also supports regex-based substring extraction.
SUBSTRING(text FROM pattern)SELECT
raw_phone,
SUBSTRING(raw_phone FROM '[0-9]+') AS first_digit_sequence
FROM customers_raw_text;This extracts the first continuous digit block.
\[\downarrow\]
| raw_phone | first_digit_sequence |
|---|---|
077600945 |
077 |
077-600-045 |
077 |
(374)-77-600-945 |
374 |
37477600945 |
37477600945 |
77600945 |
77600945 |
Positional vs Pattern-Based SUBSTRING()
| Use Case | Prefer |
|---|---|
| Fixed-length identifiers | Positional |
| Variable formats | Regex-based |
| Performance-critical logic | Positional |
| Unknown structure | Regex-based |
Analytical Best Practice
- use positional SUBSTRING() when formats are stable
- use regex SUBSTRING() when formats vary
- always validate results with:
LENGTH()COUNT(DISTINCT ...)
Every SUBSTRING() encodes assumptions about:
- string length
- format stability
- semantic meaning of positions
If formats change upstream, SUBSTRING() can silently break logic.
Always document:
- why a position was chosen
- what happens if length changes
SUBSTRING() is a powerful extraction tool, but only when its assumptions are explicit and validated.
CONCAT() and ||
After extracting and cleaning text (TRIM(), LOWER(), INITCAP(), SUBSTRING()), a very common analytical task is to construct new text values from multiple columns.
This is where CONCAT() and the string concatenation operator || are used.
They allow you to build full identifiers, labels, and display fields from atomic components.
Why CONCAT() Matters in Analytics
In real analytical datasets, information is often split across columns:
- first name + last name
- city + zip code
- category + subcategory
- year + quarter
From an analytical perspective, this means:
- reporting often requires combined fields
- joins may rely on constructed keys
- dashboards need human-readable labels
Concatenation is therefore a feature engineering operation, not just formatting.
CONCAT() | Syntax Overview
CONCAT(value1, value2, ..., valueN)Key properties:
- accepts multiple arguments
- treats
NULLas an empty string - always returns text
PostgreSQL also supports the operator form:
value1 || value2Example 1 | Building Full Names (Raw)
Assume we have separate first_name and last_name columns.
SELECT
customer_id,
first_name,
last_name,
CONCAT(first_name, ' ', last_name) AS full_name
FROM customers_raw_text;Output
| customer_id | first_name | last_name | full_name |
|---|---|---|---|
| 1 | john | doe | john doe |
| 2 | MARY | SMITH | MARY SMITH |
| 3 | aLEx | johnson | aLEx johnson |
| 4 | anna | VAN DYKE | anna VAN DYKE |
| 5 | NULL | brown | brown |
Example 2 | CONCAT() + INITCAP() (Recommended Pattern)
A very common and safe analytical pattern is:
- normalize case
- then concatenate
SELECT
customer_id,
INITCAP(first_name) AS first_name_clean,
INITCAP(last_name) AS last_name_clean,
INITCAP(CONCAT(
first_name,
' ',
last_name
)) AS full_name_clean
FROM customers_raw_text;Output
| customer_id | first_name_clean | last_name_clean | full_name_clean |
|---|---|---|---|
| 1 | John | Doe | John Doe |
| 2 | Mary | Smith | Mary Smith |
| 3 | Alex | Johnson | Alex Johnson |
| 4 | Anna | Van Dyke | Anna Van Dyke |
| 5 | NULL | Brown | Brown |
CONCAT() vs ||
The same result can be achieved using the operator form.
SELECT
customer_id,
INITCAP(first_name) || ' ' || INITCAP(last_name) AS full_name_clean
FROM customers_raw_text;Key Difference
CONCAT()treatsNULLas empty
||returnsNULLif any operand is NULL
This difference is analytically important.
- use
CONCAT()whenNULLvalues are possible
- use
||when you wantNULLpropagation
- always normalize text before concatenation
- validate results with
COUNT(DISTINCT ...)
Concatenated fields:
- often look harmless
- but frequently become join keys or grouping dimensions
If constructed inconsistently, they lead to:
- duplicate entities
- broken joins
- misleading KPIs
Always document:
- construction logic
- ordering
- casing rules
CONCAT() is a simple function, but it plays a critical role in building reliable analytical dimensions.
POSITION() / STRPOS()
The next analytical task is often to locate where something appears inside a string.
This is exactly what POSITION() and its PostgreSQL-specific alias STRPOS() do.
They answer a simple but powerful question:
Where does a given substring start?
Why POSITION() / STRPOS() Matters in Analytics
In analytical workflows, knowing the position of a character or substring allows you to:
- validate expected formats
- detect malformed values
- drive conditional logic
- prepare for controlled extraction
Typical analytical scenarios include:
- checking if a delimiter exists
- identifying country codes in phone numbers
- detecting annotations in categories
- validating structured identifiers
POSITION() | Syntax Overview
POSITION(substring IN text)- returns the 1-based position of
substring - returns
0if the substring is not found
PostgreSQL also supports:
STRPOS(text, substring)Both behave identically.
Example 1 | Detect Parentheses in Categories
Before removing annotations, we may want to detect which rows contain them.
SELECT
category_raw,
POSITION('(' IN category_raw) AS open_paren_pos
FROM customers_raw_text;Output
| category_raw | open_paren_pos |
|---|---|
| Accessories (Promo) | 13 |
| Electronics (Old) | 13 |
| Accessories | 0 |
| Electronics (Promo) | 13 |
| Accessories (Test) | 13 |
Analytical Interpretation
open_paren_pos > 0→ annotation exists
open_paren_pos = 0→ clean category
This allows you to:
- flag rows for cleaning,
- apply conditional logic
- measure data quality issues
Example 2 | STRPOS() for Phone Number Diagnostics
We can also use STRPOS() to detect formatting characters.
SELECT
raw_phone,
STRPOS(raw_phone, '-') AS hyphen_pos,
STRPOS(raw_phone, '(') AS paren_pos
FROM customers_raw_text;Output
| raw_phone | hyphen_pos | paren_pos |
|---|---|---|
077600945 |
0 | 0 |
077-600-045 |
4 | 0 |
(374)-77-600-945 |
6 | 1 |
37477600945 |
0 | 0 |
77600945 |
0 | 0 |
Analytical Interpretation
This immediately tells us:
- which rows contain separators
- which rows include country codes
- which rows are already clean
This is often done before applying REGEXP_REPLACE().
POSITION() as a Validation Tool
A common analytical pattern is to convert position checks into flags.
SELECT
customer_id,
category_raw,
POSITION('(' IN category_raw) > 0 AS has_annotation
FROM customers_raw_text;Conceptual Output
| customer_id | category_raw | has_annotation |
|---|---|---|
| 1 | Accessories (Promo) | true |
| 2 | Electronics (Old) | true |
| 3 | Accessories | false |
| 4 | Electronics (Promo) | true |
| 5 | Accessories (Test) | true |
This is useful for:
- data quality reporting
- audit checks
- conditional cleaning logic
POSITION() vs SUBSTRING()
| Question | Prefer |
|---|---|
| Where is it? | POSITION() / STRPOS() |
| Extract it | SUBSTRING() |
| Validate format | POSITION() |
| Clean text | REGEXP_REPLACE() |
Analytical Best Practice
- use
POSITION()to measure and detect - use
SUBSTRING()to extract - use
REGEXP_REPLACE()to clean - never assume structure without checking positions
Assuming a substring exists without checking its position leads to:
- incorrect extractions
- unexpected NULLs
- silent logic failures
Always validate structure before relying on it.
POSITION() and STRPOS() are small functions with high analytical value when used for validation and diagnostics.
SPLIT_PART()
After locating structure with POSITION() and before applying complex regex logic,
a very common analytical task is to split a string by a known delimiter.
SPLIT_PART() extracts a specific segment from a delimited text field.
It is simpler and safer than regex when the structure is known and consistent.
Why SPLIT_PART() Matters in Analytics
In real datasets, multiple values are often stored in a single column, separated by delimiters.
- phone numbers (
077-600-045) - composite IDs (
ORD-2024-001) - hierarchical categories (
Electronics-Mobile-Android) - version strings (
v1.2.5)
From an analytical perspective:
- grouping often requires one specific segment
- joins depend on consistent extracted parts
- validation depends on expected segment positions
SPLIT_PART() | Syntax
SPLIT_PART(text, delimiter, position)text→ input string
delimiter→ character or string used to split
position→ 1-based index of the part to return
If the requested position does not exist, SPLIT_PART() returns an empty string.
Example 1 | Split Phone Numbers by Hyphen (First Part)
Input
| raw_phone |
|---|
077-600-045 |
(374)-77-600-945 |
77600945 |
\[\downarrow\]
SELECT
raw_phone,
SPLIT_PART(raw_phone, '-', 1) AS first_part
FROM customers_raw_text;\[\downarrow\]
Output
| raw_phone | first_part |
|---|---|
077-600-045 |
077 |
(374)-77-600-945 |
(374) |
77600945 |
77600945 |
Example 2 | Split Phone Numbers by Hyphen (Second Part)
We can also extract the second segment.
\[\downarrow\]
SELECT
raw_phone,
SPLIT_PART(raw_phone, '-', 2) AS second_part
FROM customers_raw_text;\[\downarrow\]
Output
| raw_phone | second_part |
|---|---|
077-600-045 |
600 |
(374)-77-600-945 |
77 |
77600945 |
Analytical Interpretation
- empty result means the delimiter does not exist
- empty string is not NULL
- malformed or inconsistent rows are exposed immediately
This step is often used as a diagnostic, not just extraction.
Important Behavior to Remember
- missing segments return an empty string
- empty strings can silently affect:
- grouping
- counting
- joins
Because of this, SPLIT_PART() is often paired with NULLIF().
Example 3 | SPLIT_PART() After Cleaning
In practice, SPLIT_PART() is rarely used on raw data.
\[\downarrow\]
SELECT
raw_phone,
SPLIT_PART(
REGEXP_REPLACE(raw_phone, '[^0-9-]', '', 'g'),
'-',
2
) AS clean_second_part
FROM customers_raw_text;This combines:
- pattern-based cleaning
- delimiter-based extraction
SPLIT_PART() vs Other String Functions
| Question | Prefer |
|---|---|
| Known delimiter | SPLIT_PART() |
| Fixed character positions | SUBSTRING() |
| Variable formats | REGEXP_REPLACE() |
| Structure validation | POSITION() |
Analytical Best Practice
- validate delimiter presence with
POSITION() - clean text before splitting
- handle empty strings explicitly
- never assume all rows share the same structure
SPLIT_PART() returns empty strings, not NULL.
This can:
- inflate
COUNT(DISTINCT ...) - break joins
- hide malformed records
Always validate results using:
LENGTH()NULLIF()COUNT(DISTINCT ...)
NULLIF()
After handling extraction, cleaning, and construction of text values, the next important analytical task is preventing misleading values from entering metrics and logic.
NULLIF() is used to convert specific values into NULL intentionally.
It answers the question:
When should a value be treated as “missing”?
Why NULLIF() Matters in Analytics
In real datasets, missing information is often encoded as fake values, such as:
- empty strings
'' - placeholders like
'N/A','UNKNOWN' - default values like
'0'
From an analytical perspective, these values:
- distort counts
- break averages
- pollute grouping keys
- hide true data quality issues
NULLIF() allows you to restore semantic meaning by converting such placeholders into NULL.
NULLIF() | Syntax Overview
NULLIF(value, comparison_value)How it works:
- if
value = comparison_value→ returnsNULL - otherwise → returns
value
Example 1 | Empty Strings to NULL
Assume some customers have missing last names stored as empty strings.
SELECT
customer_id,
last_name,
NULLIF(last_name, '') AS last_name_clean
FROM customers_raw_text;Conceptual Output
| customer_id | last_name | last_name_clean |
|---|---|---|
| 1 | doe | doe |
| 2 | smith | smith |
| 3 | johnson | johnson |
| 4 | VAN DYKE | VAN DYKE |
| 5 | NULL |
Analytical Interpretation
- empty strings are not meaningful values
- converting them to
NULLallows:- correct
COUNT(last_name) - correct
COUNT(DISTINCT last_name) - accurate completeness checks
- correct
Example 2 | Placeholder Text to NULL
Sometimes missing values are encoded explicitly.
SELECT
category_raw,
NULLIF(category_raw, 'UNKNOWN') AS category_clean
FROM customers_raw_text;Example 3 | NULLIF() Before Aggregation
NULLIF() is often used before aggregation to prevent bad values from entering metrics.
SELECT
COUNT(NULLIF(category_raw, 'UNKNOWN')) AS valid_categories
FROM customers_raw_text;This ensures that placeholder values do not inflate counts.
NULLIF() with CONCAT() (Defensive Pattern)
When building labels, NULLIF() can prevent ugly outputs.
SELECT
customer_id,
CONCAT(
first_name,
' ',
NULLIF(last_name, '')
) AS full_name_safe
FROM customers_raw_text;Without NULLIF(), empty strings may silently produce malformed labels.
NULLIF() vs COALESCE()
| Goal | Use |
|---|---|
| Convert bad value to NULL | NULLIF() |
| Replace NULL with value | COALESCE() |
| Restore missing semantics | NULLIF() |
| Impute missing values | COALESCE() |
These functions are often used together, but serve opposite purposes.
Analytical Best Practice
Before aggregation or joins:
- identify placeholder values
- convert them to
NULLusingNULLIF()
- then apply aggregation or imputation logic
This keeps business meaning intact.
Placeholder values are more dangerous than NULLs.
- they look valid
- they pass filters
- they corrupt metrics silently
Always audit text fields for placeholder values
before trusting analytical results.
NULLIF() is a small function with high analytical impact, especially in data quality and KPI correctness.
LEFT() / RIGHT()
After working with positional extraction (SUBSTRING()) and pattern-based logic, we often encounter text fields with a fixed directional structure.
In such cases, extracting characters from the beginning or the end of a string is clearer and safer than calculating positions.
This is exactly what LEFT() and RIGHT() are designed for.
Why LEFT() / RIGHT() Matter in Analytics
Many analytical identifiers encode meaning at the edges of a string:
- country or operator prefixes
- regional or product codes
- fixed-length identifiers
- version or batch suffixes
When structure is stable, directional extraction is:
- more readable
- less error-prone
- easier to audit
Syntax Overview
LEFT(text, n)
RIGHT(text, n)text→ input string
n→ number of characters to extract
- extraction starts strictly from the left or right
Example 1 | Extract Phone Prefix with LEFT()
Assume phone numbers have already been cleaned to digits only
and the first 3 digits represent a country or operator code.
SELECT
raw_phone,
LEFT(
REGEXP_REPLACE(raw_phone, '[^0-9]', '', 'g'),
3
) AS phone_prefix
FROM customers_raw_text;Output
| raw_phone | phone_prefix |
|---|---|
077600945 |
077 |
077-600-045 |
077 |
(374)-77-600-945 |
374 |
37477600945 |
374 |
77600945 |
776 |
Analytical Interpretation
- prefixes reveal country or operator
- enables routing or segmentation logic
- supports validation of expected formats
Example 2 | Extract Core Identifier with RIGHT()
If the last 8 digits uniquely identify a phone number,
RIGHT() provides a clean and expressive solution.
SELECT
raw_phone,
RIGHT(
REGEXP_REPLACE(raw_phone, '[^0-9]', '', 'g'),
8
) AS phone_core
FROM customers_raw_text;Output
| raw_phone | phone_core |
|---|---|
077600945 |
77600945 |
37477600945 |
77600945 |
77600945 |
77600945 |
LEFT() / RIGHT() vs SUBSTRING()
| Scenario | Prefer |
|---|---|
| Fixed prefix or suffix | LEFT() / RIGHT() |
| Dynamic position | SUBSTRING() |
| Readability | LEFT() / RIGHT() |
| Complex extraction | SUBSTRING() |
Analytical Best Practice
- apply cleaning before directional extraction
- validate assumptions with
LENGTH()
- use LEFT()/RIGHT() only when format is guaranteed
- document why a specific length was chosen
Directional extraction assumes:
- stable formatting
- consistent string length
- no missing segments
If upstream formats change, LEFT() and RIGHT() can silently produce incorrect results.
Always validate structure before relying on directional logic.
LEFT() and RIGHT() are simple functions, but when used correctly, they provide clear, auditable, and efficient extraction for fixed-format analytical fields.
Dummy Dataset | Messy Transaction Text Data
This dataset simulates realistic dirty text data commonly seen in transactional systems.
It will be used for both in-class exercises and the take-home case study.
The table intentionally contains:
- inconsistent phone formats
- annotated categories
- mixed casing
- formatting symbols
- duplicate logical entities
Step 1 | Create Dummy Transaction Table
DROP TABLE IF EXISTS transactions_text_demo;
CREATE TABLE transactions_text_demo (
transaction_id INTEGER,
customer_id INTEGER,
raw_phone TEXT,
category_raw TEXT,
quantity INTEGER,
price NUMERIC(10,2)
);Step 2 | Populate with 1,000 Rows of Messy Data
INSERT INTO transactions_text_demo
SELECT
gs AS transaction_id,
(RANDOM() * 200)::INT + 1 AS customer_id,
CASE (gs % 6)
WHEN 0 THEN ' 077600945 '
WHEN 1 THEN '077-600-045'
WHEN 2 THEN '(374)-77-600-945'
WHEN 3 THEN '37477600945'
WHEN 4 THEN '77600945'
ELSE '077600945'
END AS raw_phone,
CASE (gs % 5)
WHEN 0 THEN 'Accessories (Promo)'
WHEN 1 THEN 'Accessories (Test)'
WHEN 2 THEN 'Electronics (Old)'
WHEN 3 THEN 'Electronics (Promo)'
ELSE 'Accessories'
END AS category_raw,
(RANDOM() * 5)::INT + 1 AS quantity,
(RANDOM() * 500 + 10)::NUMERIC(10,2) AS price
FROM generate_series(1, 1000) AS gs;Step 3 | Sanity Check
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT raw_phone) AS distinct_raw_phones,
COUNT(DISTINCT category_raw) AS distinct_categories
FROM transactions_text_demo;In-Class Assignment
Objective
Use measurement-first SQL to understand how dirty text affects grouping and counts.
Task 1 | Phone Number Diagnostics
Write a query that shows:
raw_phoneLENGTH(raw_phone)- position of
'-' - position of
'(' - count of rows per pattern
Use:
LENGTH()POSITION()orSTRPOS()GROUP BY
Task 2 | Category Fragmentation
Write a query that returns:
category_raw- number of transactions per category
GROUP BY category_raw
ORDER BY COUNT(*) DESC;Answer:
- how many logical categories exist?
- how many categories appear due to annotations?
Expected Insight
You should clearly see that:
- grouping on raw text produces fragmented dimensions
- phone numbers represent the same entity in multiple formats
Homework
Scenario
You are asked to produce reliable KPIs from transactions_text_demo.
Business questions:
- revenue by category
- number of unique customers
- average transaction value
Part 1 | Profiling
Write SQL queries to assess:
- phone number format diversity
- category fragmentation
- impact of dirty text on
GROUP BY
Use only diagnostic functions.
Part 2 | Standardization Layer
Create a cleaned SELECT projection (no updates) that includes:
- standardized phone number (last 8 digits)
- cleaned category (no annotations, trimmed)
- revenue per transaction
Use:
REGEXP_REPLACE()TRIM()SUBSTRING()CONCAT()where relevant
Part 3 | KPI Comparison
Compute and compare:
- revenue by raw category
- revenue by cleaned category
- unique customers (raw vs cleaned phone)
Use GROUP BY in all comparisons.
Part 4 | Analytical Explanation
Briefly explain:
- why KPIs changed
- which cleaning step had the biggest impact
- what assumptions you made
- what could silently break in production
Submission Rules
- Create
string_functions.sql - SQL only (no updates or deletes)
- all cleaning must be done in SELECT queries
- results must be reproducible
- explanations must be concise and analytical
- push it into
GitHub
This assignment tests analytical thinking, not just SQL syntax.
Dirty text → broken GROUP BY → misleading KPIs.