String Functions
2026-04-01
Text (string) functions operate at the row level and are used to inspect, clean, standardize, and transform textual data.
They are data quality tools, not cosmetic helpers.
Open Docker Desktop, Run and open Pgadmin:
Real-world text data is rarely clean:
If left untreated:
GROUP BY fragments categoriesCOUNT(DISTINCT ...) overcountsCorrect analytical order:
We always start with measurement, not transformation.
We use a controlled dummy table with deliberately inconsistent phone formats.
DROP TABLE IF EXISTS customers_raw_text;
CREATE TABLE customers_raw_text (
customer_id INTEGER,
first_name TEXT,
last_name TEXT,
raw_phone TEXT,
category_raw TEXT,
birth_date DATE
);
INSERT INTO customers_raw_text (
customer_id,
first_name,
last_name,
raw_phone,
category_raw,
birth_date
) VALUES
(1, 'joHN', 'doE', ' 077600945 ', 'Accessories (Promo)', DATE '1994-03-12'),
(2, 'MARY', 'sMiTh', '077-600-045', 'Electronics (Old)', DATE '1988-11-05'),
(3, 'aLEx', 'johnSON', '(374)-77-600-945', 'Accessories', DATE '2001-07-23'),
(4, 'anna', 'VAN DYKE', '37477600945', 'Electronics (Promo)', DATE '1999-01-30'),
(5, NULL, 'brOwn', '77600945', 'Accessories (Test)', DATE '1994-03-12');Our target standardized phone number is: 77600945
LENGTH() returns the number of characters in a text value.
It is a diagnostic function, not a cleaning function.
Before cleaning, we ask:
Do all values have the same length?
If not, they cannot be directly comparable.
\[\downarrow\]
| raw_phone | phone_length |
|---|---|
' 077600945 ' |
13 |
'077600945' |
9 |
'77600945' |
8 |
'077-600-045' |
11 |
'(374)-77-600-945' |
15 |
'37477600945' |
11 |
Even without cleaning, patterns emerge:
Without modifying data, we detected:
This prevents blind transformations later.
Using LENGTH() early allows you to:
77600945 has a known expected lengthWhitespace issues are:
Yet they are often overlooked in analysis.
Whitespace has no business meaning.
This makes TRIM() a safe first transformation.
\[\downarrow\]
| raw_phone | length | trimmed_phone | trimmed_length |
|---|---|---|---|
' 077600945 ' |
13 | '077600945' |
9 |
'077600945' |
9 | '077600945' |
9 |
'77600945' |
8 | '77600945' |
8 |
'077-600-045' |
11 | '077-600-045' |
11 |
'(374)-77-600-945' |
15 | '(374)-77-600-945' |
15 |
'37477600945' |
11 | '37477600945' |
11 |
SQL also provides directional variants:
\[\downarrow\]
TRIM() is equivalent to applying both.
TRIM() helps us decide:
Measurement → transformation → re-measurement
LTRIM(raw_phone)RTRIM(raw_number)Text values often differ only by capitalization, even though they represent the same entity.
From an analytical perspective:
Examples of equivalent values treated as different:
john, John, JOHNvan dyke, Van Dyke, VAN DYKEIf left untreated:
GROUP BY fragments categoriesCOUNT(DISTINCT ...) overcountsLOWER() converts all characters to lowercase.
\[\downarrow\]
| first_name | first_name_lower |
|---|---|
john |
john |
ANNa |
anna |
mARy |
mary |
geORGe |
george |
ALEx |
alex |
lAuRA |
laura |
UPPER() converts all characters to uppercase.
\[\downarrow\]
| last_name | last_name_upper |
|---|---|
DOE |
DOE |
smith |
SMITH |
joHNson |
JOHNSON |
brown |
BROWN |
O'NEILL |
O'NEILL |
van dyke |
VAN DYKE |
INITCAP() converts text to title case: jOhn sMiTH \(\rightarrow\) John Smith
\[\downarrow\]
| first_name | first_name_clean |
|---|---|
john |
John |
ANNa |
Anna |
mARy |
Mary |
geORGe |
George |
ALEx |
Alex |
lAuRA |
Laura |
| Function | Best Used For |
|---|---|
LOWER() |
joins, grouping, deduplication |
UPPER() |
codes, abbreviations |
INITCAP() |
names, presentation |
never mix raw and normalized text in analysis
Consistency matters more than preference.
After handling whitespace and capitalization, the next common issue is structural noise inside text values.
Typical examples include:
077-600-945077600945.), slashes(\,/), or underscores:REPLACE() removes or substitutes known, explicit characters.
This makes it ideal when:
REPLACE() is not pattern-based > it is literal and predictable.
Phone numbers often contain hyphens as visual separators.
\[\downarrow\]
| raw_phone | phone_no_hyphen |
|---|---|
077-600-045 |
077600045 |
(374)-77-600-945 |
(374)77600945 |
077600945 |
077600945 |
You can apply REPLACE() multiple times to remove different characters.
\[\downarrow\]
| raw_phone | phone_clean_partial |
|---|---|
(374)-77-600-945 |
37477600945 |
077-600-045 |
077600045 |
077600945 |
077600945 |
This is progress, but still not fully standardized.
At this stage:
We still have:
This tells us REPLACE() alone is not sufficient.
Every REPLACE() encodes a business assumption.
Document these decisions!
Use REPLACE() when:
Avoid REPLACE() when:
REGEXP_REPLACE()REGEXP stands for “Regular Expression”
REGEXP_REPLACE() allows you to define rules, not characters.
From an analytical perspective:
text → input stringpattern → regex rulereplacement → substitutionflags → modifiers (g = global)Pattern: [^0-9]
\[\downarrow\]
| raw_phone | digits_only |
|---|---|
077600945 |
077600945 |
077600945 |
077600945 |
77600945 |
77600945 |
077-600-045 |
077600045 |
(374)-77-600-945 |
37477600945 |
37477600945 |
37477600945 |
() ContentPattern: \([^)]*\)
\[\downarrow\]
| category_raw | category_clean1 |
|---|---|
| Accessories (Promo) | Accessories |
| Electronics (Old) | Electronics |
| Accessories | Accessories |
| Electronics (Promo) | Electronics |
| Accessories (Test) | Accessories |
Whitespace Alert
Trailing spaces may remain
'Accessories ' ≠ 'Accessories'
\[\downarrow\]
| category_raw | category_clean2 |
|---|---|
| Accessories (Promo) | Accessories |
| Electronics (Old) | Electronics |
| Accessories | Accessories |
| Electronics (Promo) | Electronics |
| Accessories (Test) | Accessories |
Pattern: \s*\(.*?\)
\[\downarrow\]
| category_raw | category_clean3 |
|---|---|
| Accessories (Promo) | Accessories |
| Electronics (Old) | Electronics |
| Accessories | Accessories |
| Electronics (Promo) | Electronics |
| Accessories (Test) | Accessories |
| Approach | Result | Notes |
|---|---|---|
| Pattern 1 | Accessories |
Leaves spaces |
| Pattern 2 | Accessories |
Safe & explicit |
| Pattern 3 | Accessories |
Compact regex |
| Scenario | Prefer |
|---|---|
| Known characters | REPLACE() |
| Variable formats | REGEXP_REPLACE() |
| Validation & extraction | REGEXP_REPLACE() |
| Simplicity | REPLACE() |
Analytical Warning | Regex Is Powerful
Always validate with:
LENGTH()COUNT(DISTINCT ...)SUBSTRING() extracts a portion of text based on position or pattern.
It is used to isolate signal from compound fields.
\[\downarrow\]
| raw_phone | phone_core |
|---|---|
077600945 |
77600945 |
37477600945 |
77600945 |
77600945 |
77600945 |
\[\rightarrow\]
| category_raw | category_prefix |
|---|---|
| Accessories (Promo) | Accessories |
| Electronics (Old) | Electronics |
| Accessories | Accessories |
| Electronics (Promo) | Electronics |
| Accessories (Test) | Accessories |
PostgreSQL supports regex-based extraction:
\[\downtarrow\]
| raw_phone | first_digit_sequence |
|---|---|
077600945 |
077600945 |
077-600-045 |
077 |
(374)-77-600-945 |
374 |
37477600945 |
37477600945 |
77600945 |
77600945 |
| Use Case | Prefer |
|---|---|
| Fixed-length identifiers | Positional |
| Variable formats | Regex-based |
| Performance-critical logic | Positional |
| Unknown structure | Regex-based |
SUBSTRING() when formats are stableSUBSTRING() when formats varyLENGTH() and COUNT(DISTINCT ...)CONCAT() and ||After cleaning text with:
TRIM()LOWER(), UPPER(), INITCAP()SUBSTRING()we often need to construct new text values.
CONCAT() and || allow us to combine atomic fields into meaningful analytical dimensions.
From an analytical perspective:
Concatenation is feature engineering, not cosmetic formatting.
Key properties:
NULL as empty||PostgreSQL also supports string concatenation using:
Behavior differs when NULL values are present.
\[\downarrow\]
| customer_id | first_name | last_name | full_name |
|---|---|---|---|
| 1 | john | doe | john doe |
| 2 | MARY | SMITH | MARY SMITH |
| 3 | aLEx | johnson | aLEx johnson |
| 4 | anna | VAN DYKE | anna VAN DYKE |
| 5 | NULL | brown | brown |
first_name is NULLThis output is technically valid but analytically weak.
Recommended analytical pattern: first normalize case and then concatenate
\[\downarrow\]
| customer_id | first_name_clean | last_name_clean | full_name_clean |
|---|---|---|---|
| 1 | John | Doe | John Doe |
| 2 | Mary | Smith | Mary Smith |
| 3 | Alex | Johnson | Alex Johnson |
| 4 | Anna | Van Dyke | Anna Van Dyke |
| 5 | NULL | Brown | Brown |
||Same logic using the operator form:
CONCAT() treats NULL as empty|| propagates NULLThis distinction directly affects analytical results.
| Situation | Prefer |
|---|---|
| Possible NULL values | CONCAT() |
| Strict NULL propagation | || |
| Reporting and dashboards | CONCAT() |
| Validation logic | || |
NULL should behaveCOUNT(DISTINCT ...)POSITION() / STRPOS()Where does a given substring start?
This is a diagnostic and validation step, not a cleaning step.
Knowing the position of a substring allows you to:
0 if the substring is not foundPostgreSQL also supports:
POSITION()Before cleaning categories, we first detect annotations.
\[\downarrow\]
| category_raw | open_paren_pos |
|---|---|
| Accessories (Promo) | 13 |
| Electronics (Old) | 13 |
| Accessories | 0 |
| Electronics (Promo) | 13 |
| Accessories (Test) | 13 |
open_paren_pos > 0 → annotation existsopen_paren_pos = 0 → clean categoryThis tells us which rows require cleaning.
\[\downarrow\]
| raw_phone | hyphen_pos | paren_pos |
|---|---|---|
077600945 |
0 | 0 |
077-600-045 |
4 | 0 |
(374)-77-600-945 |
6 | 1 |
37477600945 |
0 | 0 |
77600945 |
0 | 0 |
Note
This check is usually done before applying REGEXP_REPLACE().
A common analytical pattern is converting positions into boolean flags.
\[\downarrow\]
| customer_id | category_raw | has_annotation |
|---|---|---|
| 1 | Accessories (Promo) | true |
| 2 | Electronics (Old) | true |
| 3 | Accessories | false |
| 4 | Electronics (Promo) | true |
| 5 | Accessories (Test) | true |
This enables:
Without altering the original data.
| Question | Prefer |
|---|---|
| Where is it? | POSITION() / STRPOS() |
| Extract it | SUBSTRING() |
| Validate format | POSITION() |
| Clean text | REGEXP_REPLACE() |
POSITION() to measure and detectSUBSTRING() to extractREGEXP_REPLACE() to cleanSPLIT_PART()SPLIT_PART() allows you to extract a specific segment from such strings
without using regular expressions.
It answers the question:
Which part of a delimited string do I need?
Used to extract a specific segment from delimited text.
Known structure → simple logic → high reliability.
| raw_phone | first_part |
|---|---|
077-600-045 |
077 |
(374)-77-600-945 |
(374) |
77600945 |
77600945 |
| raw_phone | second_part |
|---|---|
077-600-045 |
600 |
(374)-77-600-945 |
77 |
77600945 |
Use together with:
POSITION()NULLIF()NULLIF()Real-world data often encodes missing information as fake values.
NULLIF() lets you explicitly decide:
When should a value be treated as missing?
From an analytical perspective:
NULLIF() restores semantic correctness.
NULL if values are equal\[\downarrow\]
| customer_id | last_name | last_name_clean |
|---|---|---|
| 1 | doe | doe |
| 2 | smith | smith |
| 3 | johnson | johnson |
| 4 | VAN DYKE | VAN DYKE |
| 5 | NULL |
Without NULLIF():
COUNT(DISTINCT last_name) is wrong\[\downarrow\]
| category_raw | category_clean |
|---|---|
| Accessories | Accessories |
| UNKNOWN | NULL |
This prevents placeholder values from inflating KPIs.
Avoids malformed labels caused by empty strings.
| Goal | Function |
|---|---|
| Convert bad value to NULL | NULLIF() |
| Replace NULL with value | COALESCE() |
| Restore missing meaning | NULLIF() |
| Impute values | COALESCE() |
NULL firstLEFT() / RIGHT()Some analytical fields follow a fixed directional structure.
LEFT() and RIGHT() let you extract directional segments without complex logic.
From an analytical perspective:
SUBSTRING()They are often used for:
n = number of characters\[\downarrow\]
| raw_phone | phone_prefix |
|---|---|
077600945 |
077 |
077-600-045 |
077 |
(374)-77-600-945 |
374 |
37477600945 |
374 |
77600945 |
776 |
\[\downarrow\]
| raw_phone | phone_core |
|---|---|
077600945 |
77600945 |
37477600945 |
77600945 |
77600945 |
77600945 |
| Scenario | Prefer |
|---|---|
| Fixed prefix/suffix | LEFT()/RIGHT() |
| Dynamic positions | SUBSTRING() |
| Readability | LEFT()/RIGHT() |
| Complex extraction | SUBSTRING() |