Regex Manual

Regex Manual

This manual provides comprehensive technical documentation for regular expression support in the Parasol Framework. It covers pattern syntax, behaviour, and all supported features for both Fluid and C++ developers.

For information on how to use the Regex class and its methods in your code, please refer to the Regex module API documentation.

Table of Contents


Introduction

Parasol's regex support is based on the regular expression syntax defined in the ECMAScript Specification. The implementation provides full Unicode support with UTF-8 encoding enabled by default.

Key features include:

  • ECMAScript Compliance: Supports expressions defined in the latest ECMAScript specification draft
  • Unicode Support: Full UTF-8 Unicode support with property matching
  • Set Operations: Character class intersection, subtraction, and string sequences
  • Named Captures: Named and numbered capture groups with backreferences
  • Lookahead and Lookbehind: Zero-width assertions for complex matching
  • Flag Modifiers: Inline flag control for case sensitivity, multiline, and dotall modes

Regular expressions are compiled into pattern objects that can be reused for efficient matching operations.


Character Matching

Regular expressions match characters in the target text based on pattern specifications. The following table describes all character matching forms:

Pattern Description Example
. Matches any character except line terminators (U+000A, U+000D, U+2028, U+2029). With the dotall flag, matches every code point. a.c matches "abc", "aXc"
\0 Matches NULL character (U+0000) \0 matches null byte
\t Matches Horizontal Tab (U+0009) a\tb matches "a b"
\n Matches Line Feed (U+000A) a\nb matches "a\nb"
\v Matches Vertical Tab (U+000B) \v matches vertical tab
\f Matches Form Feed (U+000C) \f matches form feed
\r Matches Carriage Return (U+000D) \r\n matches Windows line ending
\cX Matches control character where X is A-Z or a-z. Value is (code point of X) & 0x1F \cA matches Ctrl-A (U+0001)
\\ Matches backslash character (U+005C) \\ matches "\"
\xHH Matches character with hexadecimal code HH (00-FF) \x41 matches "A"
\uHHHH Matches character with Unicode code point HHHH \u0041 matches "A"
\u{H...} Matches character with Unicode code point represented by hex digits (up to 10FFFF) \u{1F600} matches 😀
\X When X is one of ^ $ . * + ? ( ) [ ] { } | /, matches X literally \( matches "("
Character Any character not listed above matches itself abc matches "abc"

Line Terminators

Line terminator code points are: U+000A (Line Feed), U+000D (Carriage Return), U+2028 (Line Separator), and U+2029 (Paragraph Separator).

Escape Sequences

All escape sequences must be complete and valid. If \c is not followed by a letter A-Z or a-z, \x is not followed by two hexadecimal digits, \u is not followed by four hexadecimal digits, or \u{...} does not contain valid hexadecimal or exceeds U+10FFFF, an error_escape exception is thrown.

Special Character Escaping

In character classes (see Character Classes), the hyphen - can also be escaped as \-. The character ] must always be escaped as \] to be matched literally.


Alternatives

The | operator matches one of multiple alternative patterns, evaluated from left to right:

A|B|C

This matches pattern A, or pattern B, or pattern C. The first successful match is adopted, and remaining alternatives are not evaluated.

Example

local pattern = regex.new('abc|abcdef')
local match = pattern.match('abcdef')
-- match[1] = "abc" (not "abcdef")

Even though "abcdef" would match the second alternative completely, the pattern matches "abc" from the first alternative because alternatives are evaluated left to right.

Multiple alternatives can be combined:

local pattern = regex.new('cat|dog|bird|fish')

Character Classes

Character classes define sets of characters that can match at a single position in the target text.

Basic Character Classes

A character class is enclosed in square brackets [...] and matches any single character from the set:

Pattern Description Example
[ABC] Matches any of A, B, or C [ABC] matches "A", "B", or "C"
[^DEF] Matches any character except D, E, or F (negated class) [^DEF] matches any character but "D", "E", "F"
[G^H] Matches G, ^, or H (^ not first, so literal) [G^H] matches "G", "^", or "H"
[I-K] Matches any character from I to K inclusive (range) [I-K] matches "I", "J", "K"
[-LM] Matches -, L, or M (leading hyphen is literal) [-LM] matches "-", "L", "M"
[N-P-R] Matches N, O, P, -, or R (trailing hyphen after range is literal) [N-P-R] matches "N", "O", "P", "-", "R"
[S\-U] Matches S, -, or U (escaped hyphen) [S\-U] matches "S", "-", "U"
[.({|] Special regex characters lose their special meaning in character classes [.({|] matches ".", "(", "{", "|"
[] Empty class matches no code points (always fails) [] never matches
[^] Complement of empty class matches any code point [^] matches any character including line terminators

Character Class Rules

  1. Negation: When ^ is the first character in [], the class is negated and matches any character NOT in the set.
  2. Closing Bracket: The ] character must always be escaped as \] to be included literally in a character class.
  3. Hyphen: The - character creates a range when between two characters. To match - literally, place it first, last, or escape it as \-.
  4. Special Characters: Most special regex characters (., *, +, etc.) lose their special meaning inside character classes.

Character Ranges

Ranges define a span of consecutive Unicode code points:

local pattern = regex.new('[A-Z]')     -- Matches any uppercase letter A-Z
local pattern = regex.new('[0-9]')     -- Matches any digit
local pattern = regex.new('[a-zA-Z]')  -- Matches any letter

If the range is invalid (e.g., [b-a] where the starting code point is greater than the ending code point), an error_range exception is thrown.

Case-Insensitive Matching

When case-insensitive matching is enabled (with the icase flag), character classes expand to include case-folded variations:

local pattern = regex.new('[E-F]', regex.ICASE)
-- Matches 'E', 'F', 'e', 'f', and any Unicode case variants

Note: Range [E-f] with icase flag will match all characters from U+0045 ('E') to U+0066 ('f'), including brackets, backslash, and other punctuation, plus their case-folded variants.

Predefined Character Classes

Predefined character classes provide convenient shortcuts for common character sets:

Pattern Equivalent Description
\d [0-9] Matches any decimal digit
\D [^0-9] Matches any non-digit
\s [ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff] Matches any whitespace character (WhiteSpace + LineTerminator)
\S [^ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff] Matches any non-whitespace
\w [0-9A-Za-z_] Matches any word character (alphanumeric + underscore)
\W [^0-9A-Za-z_] Matches any non-word character
\p{...} (See Unicode Support) Matches characters with specified Unicode property
\P{...} (See Unicode Support) Matches characters without specified Unicode property

All predefined character classes can be used inside character classes:

local pattern = regex.new('[\\d!\"#$%&\'()]')  -- Matches digits or punctuation

Note: The \s whitespace class automatically expands when new code points are added to Unicode category Zs.

Set Operations

Character classes support advanced set operations for precise character matching. These operations are always available as standard features.

Intersection with &&

The intersection operator && matches characters that belong to both sets:

[A&&B]

Examples:

-- Match lowercase Latin letters only
local pattern = regex.new('[\\p{sc=Latin}&&\\p{Ll}]')
-- Matches: a, b, c, ..., z, ñ, ø, etc. (lowercase Latin)
-- Does NOT match: A, B, C, ... (not lowercase)

-- Match ASCII letters only (not extended Latin)
local pattern = regex.new('[\\p{sc=Latin}&&[A-Za-z]]')

Subtraction with --

The subtraction operator -- matches characters in the first set but not in the second:

[A--B]

Examples:

-- Match Latin letters that are NOT lowercase
local pattern = regex.new('[\\p{sc=Latin}--\\p{Ll}]')
-- Matches: A, B, C, ..., Z (uppercase, titlecase, etc.)
-- Does NOT match: a, b, c, ... (lowercase excluded)

-- Match letters except vowels
local pattern = regex.new('[A-Za-z--[AEIOUaeiou]]')
-- Matches: consonants only

String Sequences with \q{...}

The \q{...} syntax allows character classes to match multi-character sequences:

[a-z\q{ch|th|ph}]

This matches either:

  • Any single character from a-z, OR
  • The sequence "ch", OR
  • The sequence "th", OR
  • The sequence "ph"

Longest Match Priority: When strings are included in a character class, the longest matching string is always selected first:

local pattern = regex.new('[a-z\\q{ch|chocolate}]')
-- When matching "chocolate", matches the full word "chocolate"
-- Not "ch" followed by "ocolate"

The sequence [a-z\q{ch|th|ph}] is functionally equivalent to (?:ch|th|ph|[a-z]).

Examples:

-- Match common digraphs or single letters
local pattern = regex.new('[a-z\\q{ch|sh|th|ph}]+')

-- Match emoji sequences or letters
local pattern = regex.new('[A-Z\\q{:-)|:-(|:-D}]')

String sequences can be used with all set operators (union, intersection, subtraction).

Nesting Character Classes

Character classes can be nested as operands for set operations:

-- Valid: nested classes with operators
local pattern = regex.new('[\\p{sc=Latin}--[a-z]]')

-- Valid: nested union and subtraction
local pattern = regex.new('[A[B--C]D]')

Operator Restriction: Only one type of operator can be used per level of nesting:

-- INVALID: mixing && and -- at same level
[AB--CD]           -- Error: union (AB) then subtraction (--)

-- VALID: operators in different nesting levels
[[AB]--[CD]]       -- OK: separate nesting levels
[A[B--C]D]         -- OK: subtraction inside union

Multiple uses of the same operator are permitted:

-- Valid: multiple subtractions at same level
[\\p{sc=Latin}--\\p{Lu}--[a-z]]

Character Escaping in Character Classes

The following characters must be escaped with \ when used literally in character classes:

  • (, ), [, {, }, /, -, |
  • ] must always be escaped (even outside character classes)
-- Correct
local pattern = regex.new('[\\(\\)\\[\\]\\{\\}]')

-- Incorrect (throws error_noescape)
local pattern = regex.new('[(]')

Reserved Double Punctuators

The following 18 double-character sequences are reserved for future use and cannot appear in character classes:

!!  ##  $$  %%  **  ++  ,,  ..
::  ;;  <<  ==  >>  ??  @@  ^^
``  ~~

If any of these appear in a character class, an error_operator exception is thrown.


Quantifiers

Quantifiers specify how many times a pattern element must match. Each quantifier has a greedy and non-greedy form.

Quantifier Non-Greedy Matches Description
* *? 0 or more Repeats the preceding element zero or more times
+ +? 1 or more Repeats the preceding element one or more times
? ?? 0 or 1 Makes the preceding element optional
{n} N/A Exactly n Repeats the preceding element exactly n times
{n,} {n,}? n or more Repeats the preceding element at least n times
{n,m} {n,m}? n to m Repeats the preceding element between n and m times (inclusive)

Greedy vs Non-Greedy

Greedy quantifiers (default) match as many characters as possible while still allowing the overall pattern to succeed:

local pattern = regex.new('a.*b')
local match = pattern.match('axxxbxxxb')
-- match[1] = "axxxbxxxb" (matches up to the last 'b')

Non-greedy quantifiers (with ? suffix) match as few characters as possible while still allowing the overall pattern to succeed:

local pattern = regex.new('a.*?b')
local match = pattern.match('axxxbxxxb')
-- match[1] = "axxxb" (stops at the first 'b')

Quantifier Rules

  1. Quantifiers must have a preceding expression to quantify. Using a quantifier without a preceding element (e.g., * at the start of a pattern) throws error_badrepeat.

  2. If a quantifier range is invalid (e.g., {3,2} where n > m), an error_badbrace exception is thrown.

  3. Mismatched { or } characters throw error_brace.

Examples

-- Match one or more digits
local pattern = regex.new('\\d+')

-- Match optional sign followed by digits
local pattern = regex.new('[+-]?\\d+')

-- Match exactly 3 letters
local pattern = regex.new('[A-Za-z]{3}')

-- Match 2 to 4 word characters (greedy)
local pattern = regex.new('\\w{2,4}')

-- Match 2 to 4 word characters (non-greedy)
local pattern = regex.new('\\w{2,4}?')

-- Match at least 5 digits
local pattern = regex.new('\\d{5,}')

Quantifiers and Captured Groups

When a capturing group is quantified, the captured value is updated on each iteration. Only the last iteration's match is preserved:

local pattern = regex.new('(?:(a)|(b))+')
local match = pattern.match('ab')
-- match[1] = "ab" (full match)
-- match[2] = "" (empty, last iteration captured 'b', not 'a')
-- match[3] = "b" (last iteration captured 'b')

Grouping and Backreferences

Parentheses create groups for capturing matches and controlling operator precedence.

Capturing Groups

Capturing groups are created with (...) and are numbered starting from 1:

local pattern = regex.new('(\\d{3})-(\\d{3})-(\\d{4})')
local match = pattern.match('555-123-4567')
-- match[1] = "555-123-4567" (full match, always at index 1)
-- match[2] = "555" (first capturing group)
-- match[3] = "123" (second capturing group)
-- match[4] = "4567" (third capturing group)

Group Numbering: Groups are numbered by the position of their opening ( parenthesis from left to right:

local pattern = regex.new('((a)(b))c')
-- Group 1: ((a)(b))
-- Group 2: (a)
-- Group 3: (b)

Non-Capturing Groups

Non-capturing groups (?:...) group expressions without creating a capture:

local pattern = regex.new('(?:tak(?:e|ing))')
-- Matches "take" or "taking" without capturing

Use non-capturing groups to:

  • Apply quantifiers to multiple characters: (?:ab)+
  • Group alternatives: (?:cat|dog)
  • Improve performance (slightly faster than capturing groups)

Named Capture Groups

Named groups associate a name with a captured substring:

(?<name>...)

Example:

local pattern = regex.new('(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})')
local match = pattern.match('2025-10-14')
-- match[1] = "2025-10-14"
-- match[2] = "2025" (group 1, also accessible as 'year')
-- match[3] = "10" (group 2, also accessible as 'month')
-- match[4] = "14" (group 3, also accessible as 'day')

Named groups are also assigned a number and can be accessed by both name and number.

Duplicate Named Groups

Named groups can be reused if they appear in different alternatives:

local pattern = regex.new('(?<year>\\d{4})-\\d{1,2}|\\d{1,2}-(?<year>\\d{4})')
-- Matches "2025-10" or "10-2025"
-- 'year' captures the 4-digit year from either position

This feature was introduced in ES2025.

Numeric Backreferences

A backreference \N (where N is a positive integer starting from 1) matches the same text that was captured by group N:

local pattern = regex.new('(TO|to)..\\1')
-- Matches "TOMATO" or "tomato" but not "Tomato"
-- \1 refers to captured text from group 1

Example:

local pattern = regex.new('(["\']).*?\\1')
-- Matches string in quotes: "hello" or 'hello'
-- But not mixed quotes: "hello'

Named Backreferences

A backreference \k<name> matches the text captured by a named group:

local pattern = regex.new('(?<quote>["\']).*?\\k<quote>')
-- Same as above, but using named group

Backreference Rules

  1. Forward References: Backreferences can appear before their corresponding group:

    local pattern = regex.new('\\1(abc)')  -- Valid in ECMAScript
  2. Undefined Matches: A backreference to a group that hasn't captured anything matches the empty string:

    local pattern = regex.new('(a)?b\\1')
    -- Matches "b" (group 1 didn't capture, so \1 matches empty string)
  3. Invalid Groups: If a backreference refers to a non-existent group number, an error_backref exception is thrown:

    local pattern = regex.new('\\5')  -- Error: no group 5 exists

Group Capture Clearing

When a capturing group is inside a quantified expression, captures are cleared on each iteration:

local pattern = regex.new('(?:(a)|(b))+')
local match = pattern.match('ab')
-- Only the last iteration's captures are retained
-- match[2] = "" (group 1's last iteration matched nothing)
-- match[3] = "b" (group 2's last iteration matched "b")

Flag Modifiers

Flag modifiers allow inline control of matching behaviour within specific parts of a pattern.

Bounded Flag Modifiers

Bounded flag modifiers enable or disable flags only within a specific group:

(?ims-ims:...)

Available Flags:

Flag Meaning
i Case-insensitive matching (icase)
m Multiline mode (^ and $ match line boundaries)
s Dotall mode (. matches line terminators)
-i Disable case-insensitive matching
-m Disable multiline mode
-s Disable dotall mode

Examples:

-- Case-insensitive only for middle section
local pattern = regex.new('hello(?i:world)THERE')
-- Matches: "helloworldTHERE", "helloWORLDTHERE", "helloWoRlDTHERE"
-- Does NOT match: "HELLOworldthere" (case-sensitive outside group)

-- Combine multiple flags
local pattern = regex.new('(?ims:.*)')
-- Case-insensitive + multiline + dotall for entire group

-- Disable flags
local pattern = regex.new('(?i)hello(?-i:world)')
-- "hello" is case-insensitive, "world" is case-sensitive

Flag Modifier Rules

  1. Single Use per Flag: Each flag letter can only appear once per modifier group:

    -- INVALID: 'i' appears twice
    (?ii:...)      -- Throws error_modifier
    (?i-i:...)     -- Throws error_modifier
  2. Scope: Flag modifiers affect only the expressions inside their group.

  3. ES2025 Feature: Bounded flag modifiers were introduced in ES2025 and are enabled by default.


Assertions

Assertions test conditions at the current position without consuming characters (zero-width).

Anchors

Assertion Description
^ Matches at the start of the string. With multiline flag, also matches immediately after line terminators.
$ Matches at the end of the string. With multiline flag, also matches immediately before line terminators.

Examples:

-- Match lines starting with "#"
local pattern = regex.new('^#.*', regex.MULTILINE)

-- Match lines ending with ";"
local pattern = regex.new('.*;$', regex.MULTILINE)

Word Boundaries

Assertion Description
\b Matches at a word boundary (between \w and \W)
\B Matches at a non-word boundary (not between \w and \W)

Examples:

-- Match "cat" as a whole word
local pattern = regex.new('\\bcat\\b')
-- Matches: "cat in hat"
-- Does NOT match: "concatenate"

-- Match "cat" not as a whole word
local pattern = regex.new('\\Bcat\\B')
-- Matches: "concatenate"
-- Does NOT match: "cat in hat"

Note: Inside a character class [...], \b matches the BEL character (U+0008), not a word boundary. Using \B inside a character class throws error_escape.

Lookahead Assertions

Lookahead assertions check if a pattern matches ahead without consuming characters:

Assertion Description
(?=...) Positive lookahead: succeeds if pattern matches ahead
(?!...) Negative lookahead: succeeds if pattern does NOT match ahead

Examples:

-- Match "a" only if followed by "bc" or "def"
local pattern = regex.new('a(?=bc|def)')
-- Matches: "abc" (captures "a"), "adef" (captures "a")
-- Does NOT match: "axyz"

-- Match "a" only if NOT followed by "bc" or "def"
local pattern = regex.new('a(?!bc|def)')
-- Matches: "axyz" (captures "a")
-- Does NOT match: "abc", "adef"

-- Find & symbols that are not HTML entities
local pattern = regex.new('&(?!amp;|lt;|gt;|#)')
-- Matches bare "&" but not "&amp;", "&lt;", etc.

Lookbehind Assertions

Lookbehind assertions check if a pattern matches behind without consuming characters:

Assertion Description
(?<=...) Positive lookbehind: succeeds if pattern matches behind
(?<!...) Negative lookbehind: succeeds if pattern does NOT match behind

Examples:

-- Match "a" only if preceded by "bc" or "de"
local pattern = regex.new('(?<=bc|de)a')
-- Matches: "bca" (captures "a"), "dea" (captures "a")
-- Does NOT match: "xa"

-- Match "a" only if NOT preceded by "bc" or "de"
local pattern = regex.new('(?<!bc|de)a')
-- Matches: "xa" (captures "a")
-- Does NOT match: "bca", "dea"

Assertion Combinations

Assertions can be combined for complex matching:

-- Match words between 3-6 letters containing at least one vowel
local pattern = regex.new('\\b(?=\\w*[aeiou])\\w{3,6}\\b', regex.ICASE)

-- Match integer strings that are not part of larger numbers
local pattern = regex.new('(?<!\\d)\\d+(?!\\d)')

Unicode Support

Parasol's regex implementation provides full Unicode support with UTF-8 encoding enabled by default.

Unicode Properties

Unicode properties match characters based on their Unicode characteristics using \p{...} and \P{...}:

Pattern Description
\p{Property} Matches characters with the specified Unicode property
\P{Property} Matches characters without the specified Unicode property

Common Unicode Properties

Script Properties

Match characters from specific writing systems:

-- Match Latin characters
local pattern = regex.new('\\p{sc=Latin}+')

-- Match Greek characters
local pattern = regex.new('\\p{Script=Greek}+')

-- Match characters used in Latin or Common scripts
local pattern = regex.new('\\p{scx=Latin}+')

Common scripts: Latin, Greek, Cyrillic, Han, Arabic, Hebrew, Hiragana, Katakana, etc.

General Categories

Match characters by their general category:

Property Description Examples
\p{Lu} Uppercase letter A, B, Z, À, Ω
\p{Ll} Lowercase letter a, b, z, à, ω
\p{Lt} Titlecase letter Dž, Lj, Nj
\p{L} Any letter (Lu|Ll|Lt|Lm|Lo) All letters
\p{Nd} Decimal number 0-9, ০-৯
\p{N} Any number (Nd|Nl|No) All numbers
\p{P} Punctuation ., !, ?, ;
\p{S} Symbol $, +, =, ©
\p{Z} Separator Space, non-breaking space
\p{C} Other (control, format, etc.) Control characters

Examples:

-- Match any letter in any script
local pattern = regex.new('\\p{L}+')

-- Match digits in any script
local pattern = regex.new('\\p{Nd}+')

-- Match all punctuation
local pattern = regex.new('\\p{P}+')

Binary Properties

Binary properties have true/false values:

-- Match whitespace characters
local pattern = regex.new('\\p{White_Space}+')

-- Match emoji
local pattern = regex.new('\\p{Emoji}')

-- Match characters used in identifiers
local pattern = regex.new('\\p{ID_Start}\\p{ID_Continue}*')

Unicode Property Syntax

Properties can be specified in several formats:

-- Short form
\\p{Lu}              -- Uppercase letter
\\p{sc=Latin}        -- Latin script

-- Long form
\\p{Script=Latin}
\\p{General_Category=Uppercase_Letter}

-- Binary properties
\\p{Emoji}
\\p{White_Space}

For a complete list of available properties, see the ECMAScript Unicode Property Table.

String Properties

Some Unicode properties match sequences of multiple characters (string properties). These can be used in character classes except negated classes:

-- Valid: string property in positive class
local pattern = regex.new('[\\p{RGI_Emoji}]')

-- INVALID: string property with negation
local pattern = regex.new('[^\\p{RGI_Emoji}]')  -- Throws error_complement

-- INVALID: string property with \P{...}
local pattern = regex.new('\\P{RGI_Emoji}')     -- Throws error_complement

Unicode Case Folding

When case-insensitive matching is enabled with the icase flag, Unicode case folding rules apply:

local pattern = regex.new('café', regex.ICASE)
-- Matches: "café", "CAFÉ", "Café", "cAfÉ", etc.

Case folding follows Unicode rules, which may match more characters than simple ASCII uppercasing/lowercasing:

local pattern = regex.new('ß', regex.ICASE)
-- Matches: "ß" and "SS" (German sharp S case-folds to SS)

Unicode Code Point Ranges

Character classes operate on Unicode code points:

-- Match all characters in Basic Multilingual Plane
local pattern = regex.new('[\\u0000-\\uFFFF]+')

-- Match emoji range (partial)
local pattern = regex.new('[\\u{1F600}-\\u{1F64F}]+')

Invalid UTF-8 Handling

The regex engine validates UTF-8 sequences:

  1. Trailing bytes must be in range 0x80-0xBF. Invalid trailing bytes cause matching to fail at that position.

  2. Code points must be ≤ 0x10FFFF. Values exceeding this cause matching to fail.

  3. Non-shortest forms are rejected. For example, U+0030 (digit '0') must be encoded as 0x30, not as the longer forms 0xC0 0xB0 or 0xE0 0x80 0xB0.

At pattern compile time, invalid UTF-8 throws error_utf8. At matching time, invalid UTF-8 leads to match failure at that position.


Compilation Flags

Compilation flags affect how a regex pattern is compiled and interpreted. These flags are specified when creating a regex object.

Flag Effect
ICASE Case-insensitive matching. Matches characters regardless of case using Unicode case-folding rules.
MULTILINE Multiline mode. The ^ and $ anchors match at line boundaries (after/before line terminators) in addition to string boundaries.
DOTALL Dotall (singleline) mode. The . metacharacter matches line terminators (U+000A, U+000D, U+2028, U+2029) in addition to all other characters.

Flag Usage

The exact syntax for specifying flags depends on the language binding:

Fluid:

local pattern = regex.new('hello', regex.ICASE)
local pattern = regex.new('.*', regex.DOTALL)
local pattern = regex.new('^line', regex.MULTILINE + regex.ICASE)

C++:

auto pattern = pf::regex("hello", pf::regex::ICASE);
auto pattern = pf::regex(".*", pf::regex::DOTALL);
auto pattern = pf::regex("^line", pf::regex::MULTILINE | pf::regex::ICASE);

Flag Effects

ICASE (Case-Insensitive)

Makes pattern matching case-insensitive using Unicode case-folding:

local pattern = regex.new('hello', regex.ICASE)
-- Matches: "hello", "HELLO", "Hello", "HeLLo", etc.

local pattern = regex.new('[a-z]+', regex.ICASE)
-- Matches: "abc", "ABC", "aBc", etc.

MULTILINE

Changes behaviour of ^ and $ anchors to match line boundaries:

local pattern = regex.new('^\\w+', regex.MULTILINE)
-- Without MULTILINE: matches word at start of string only
-- With MULTILINE: matches word at start of string AND after each line terminator

local text = "first line\nsecond line\nthird line"
local pattern = regex.new('^\\w+', regex.MULTILINE)
-- Matches: "first", "second", "third"

DOTALL

Makes . match line terminators in addition to all other characters:

local pattern = regex.new('.*', regex.DOTALL)
-- Without DOTALL: .* matches up to (but not including) line terminators
-- With DOTALL: .* matches everything including line terminators

local text = "line 1\nline 2\nline 3"
local pattern = regex.new('.*', regex.DOTALL)
local match = pattern.match(text)
-- match[1] = "line 1\nline 2\nline 3" (entire string)

Note: When DOTALL is set, .* will match all remaining characters in the subject string.


Match Flags

Match flags modify the behaviour of matching operations at runtime, after a pattern has been compiled. These flags are passed to matching functions (test, match, search, replace, split).

Flag Effect
NOT_BEGIN_OF_LINE Do not treat the beginning of the text as the start of a line (affects ^ in multiline mode)
NOT_END_OF_LINE Do not treat the end of the text as the end of a line (affects $ in multiline mode)
NOT_BEGIN_OF_WORD Do not treat the beginning of the text as the start of a word (affects \b)
NOT_END_OF_WORD Do not treat the end of the text as the end of a word (affects \b)
NOT_NULL Do not match empty sequences
CONTINUOUS Only match at the beginning of the text (anchored search)
PREV_AVAILABLE Indicates that the previous character position is available for lookbehind assertions
REPLACE_NO_COPY In replace operations, do not copy non-matching parts of the text
REPLACE_FIRST_ONLY In replace operations, replace only the first occurrence

Match Flag Usage

Fluid:

local pattern = regex.new('\\w+')

-- Replace only first occurrence
local result = pattern.replace('hello world', 'goodbye', regex.REPLACE_FIRST_ONLY)
-- result = "goodbye world"

-- Match only at beginning
local match = pattern.match('hello world', regex.CONTINUOUS)
-- Succeeds (starts at beginning)

local match = pattern.match('  hello', regex.CONTINUOUS)
-- Fails (does not start at beginning)

Flag Details

NOT_BEGIN_OF_LINE / NOT_END_OF_LINE

Useful when matching in the middle of a larger text:

local pattern = regex.new('^hello', regex.MULTILINE)

-- Normal matching
pattern.test('hello')  -- true (at beginning)

-- With NOT_BEGIN_OF_LINE
pattern.test('hello', regex.NOT_BEGIN_OF_LINE)  -- false (not treated as line start)

NOT_NULL

Prevents matching empty strings:

local pattern = regex.new('a*')

-- Normal: matches empty string
pattern.test('')  -- true

-- With NOT_NULL: rejects empty match
pattern.test('', regex.NOT_NULL)  -- false

CONTINUOUS

Forces match to start at the beginning of the text:

local pattern = regex.new('\\d+')

-- Normal: finds "123" anywhere
pattern.match('  123')  -- Matches "123"

-- With CONTINUOUS: must start at position 0
pattern.match('  123', regex.CONTINUOUS)  -- Fails
pattern.match('123', regex.CONTINUOUS)     -- Succeeds

REPLACE_NO_COPY

Affects replace operations by excluding non-matching text:

local pattern = regex.new('\\d+')

-- Normal replace: keeps non-matching text
pattern.replace('a123b456c', 'X')  -- "aXbXc"

-- With REPLACE_NO_COPY: only includes replacements
pattern.replace('a123b456c', 'X', regex.REPLACE_NO_COPY)  -- "XX"

REPLACE_FIRST_ONLY

Limits replacement to the first match:

local pattern = regex.new('\\d+')

-- Normal replace: replaces all
pattern.replace('123 456 789', 'X')  -- "X X X"

-- With REPLACE_FIRST_ONLY: replaces only first
pattern.replace('123 456 789', 'X', regex.REPLACE_FIRST_ONLY)  -- "X 456 789"

Regular Expression Features

Parasol's regex implementation is based on the ECMAScript specification and provides the following characteristics:

ECMAScript Compliance

The implementation supports expressions defined in the ECMAScript Specification (latest draft), including:

  • ECMAScript 2018 (ES9): Named capture groups, lookbehind assertions, Unicode property escapes
  • ECMAScript 2025: Duplicate named capture groups, bounded flag modifiers
  • Set operations for character classes (intersection, subtraction, string sequences)

Differences from Other Engines

vs. Perl / PCRE

  • No \Q...\E literal sequences: Use explicit escaping instead
  • No possessive quantifiers: Use atomic groups or lookahead for equivalent behaviour
  • No recursive patterns: Not supported in ECMAScript
  • No conditional patterns: Use alternation with lookahead instead
  • Different Unicode categories: Follow ECMAScript Unicode property names

vs. .NET Regex

  • No balanced groups: Named captures cannot be reused except in alternatives
  • No inline comments: (?#...) is not supported
  • Different flag syntax: Uses ECMAScript (?ims:...) instead of (?imnsx-imnsx:...)

vs. POSIX

  • No POSIX character classes: Use Unicode properties instead (e.g., \p{Alpha} instead of [[:alpha:]])
  • No collating sequences: [.ch.] not supported
  • No equivalence classes: [=e=] not supported

Notable Behaviours

Forward Backreferences

Backreferences can appear before their corresponding groups:

local pattern = regex.new('\\1(abc)')  -- Valid

This is valid in ECMAScript but may fail or behave differently in other engines.

Undefined Group Matching

Backreferences to groups that haven't captured anything match the empty string:

local pattern = regex.new('(a)?b\\1')
-- Matches "ab" (group 1 captured nothing, so \1 matches empty string)

No Octal Escapes

The ECMAScript specification does not define octal escape sequences like \ooo or \0ooo (except \0 for NULL):

-- Valid
local pattern = regex.new('\\0')     -- Matches NULL (U+0000)

-- Invalid (not defined by ECMAScript)
local pattern = regex.new('\\101')   -- Error: invalid escape

Use hexadecimal or Unicode escapes instead:

local pattern = regex.new('\\x41')    -- 'A' in hexadecimal
local pattern = regex.new('\\u0041')  -- 'A' in Unicode

Substituting Advanced Features

Some operations not directly supported can be achieved through alternative patterns:

Intersection (Alternative Method):

-- Direct: [\p{sc=Latin}&&\p{Ll}]
-- Alternative: using lookahead
(?=\\p{sc=Latin})\\p{Ll}

Subtraction (Alternative Method):

-- Direct: [\p{sc=Latin}--\p{Ll}]
-- Alternative: using negative lookahead
(?!\\p{Ll})\\p{sc=Latin}

Atomic Groups:

-- Perl/PCRE: (?>pattern)
-- ECMAScript equivalent: (?=(pattern))\1

Performance Considerations

Compile Once, Use Many Times

Regex patterns should be compiled once and reused:

Inefficient:

for i = 1, 10000 do
   local pattern = regex.new('\\d+')  -- Compiles pattern 10,000 times
   pattern.test(data[i])
end

Efficient:

local pattern = regex.new('\\d+')  -- Compiles pattern once
for i = 1, 10000 do
   pattern.test(data[i])  -- Reuses compiled pattern
end

Store Patterns in Variables

Store frequently used patterns in variables (local or global) rather than recreating them:

-- Compiled patterns
local emailPattern = regex.new('[\\w._%+-]+@[\\w.-]+\\.[A-Za-z]{2,}')
local phonePattern = regex.new('\\d{3}-\\d{3}-\\d{4}')
local datePattern = regex.new('\\d{4}-\\d{2}-\\d{2}')

-- Use patterns multiple times efficiently
for _, contact in ipairs(contacts) do
   if emailPattern.test(contact.email) then
      processEmail(contact)
   end
   if phonePattern.test(contact.phone) then
      processPhone(contact)
   end
end

Greedy vs Non-Greedy Quantifiers

Non-greedy quantifiers can improve performance in some cases:

-- Greedy: tries to match as much as possible, then backtracks
local pattern = regex.new('<.*>')
-- Matches: "<tag>content</tag>" as one match (backtracks from end)

-- Non-greedy: stops at first opportunity
local pattern = regex.new('<.*?>')
-- Matches: "<tag>" and "</tag>" separately (no backtracking)

For HTML/XML parsing, non-greedy is typically faster:

-- Extract tag content efficiently
local pattern = regex.new('<([^>]+)>(.*?)</\\1>')

Avoid Catastrophic Backtracking

Certain patterns can cause exponential time complexity:

Dangerous Pattern:

-- Exponential backtracking on non-match
local pattern = regex.new('(a+)+b')
local text = 'aaaaaaaaaaaaaaaaaac'  -- No 'b' at end
-- This takes exponential time as pattern length increases

Solutions:

  1. Use possessive-like behaviour:

    -- Prevent backtracking with atomic group simulation
    local pattern = regex.new('(?=(a+))\\1+b')
  2. Use negated character classes:

    -- Clearer intent, better performance
    local pattern = regex.new('[^b]+b')
  3. Be specific about what you're matching:

    -- Instead of: .*
    -- Use: [^<]+ (if not matching '<')
    -- Use: \\w+ (if matching word characters)

Character Class Optimisations

Use predefined classes when possible:

-- Faster
local pattern = regex.new('\\d+')

-- Slower (equivalent but not optimised)
local pattern = regex.new('[0-9]+')

Simplify complex classes:

-- Complex
local pattern = regex.new('[A-Za-z0-9_]+')

-- Simpler and equivalent
local pattern = regex.new('\\w+')

Anchoring Patterns

Anchor patterns to reduce search space:

-- Unanchored: searches entire string
local pattern = regex.new('\\d+')

-- Anchored: only checks from beginning
local pattern = regex.new('^\\d+')

-- Anchored both ends: exact match only
local pattern = regex.new('^\\d+$')

Unicode Property Matching

Unicode properties are optimised internally, but broad categories are faster than specific scripts:

-- Faster: general category
local pattern = regex.new('\\p{L}+')  -- All letters

-- Slower: specific script
local pattern = regex.new('\\p{sc=Latin}+')  -- Latin letters only

Best Practices Summary

  1. Compile patterns once, reuse many times
  2. Store patterns in variables
  3. Use non-greedy quantifiers when appropriate
  4. Anchor patterns when possible (^, $)
  5. Avoid nested quantifiers that can cause exponential backtracking
  6. Use predefined character classes (\d, \w, \s)
  7. Be specific in patterns to reduce backtracking
  8. Test performance with realistic data

Common Patterns and Examples

This section provides practical regex patterns for common use cases.

Email Validation

Basic email pattern:

local pattern = regex.new('[\\w._%+-]+@[\\w.-]+\\.[A-Za-z]{2,}')
-- Matches: user@example.com, first.last@sub.domain.co.uk

Explanation:

  • [\w._%+-]+ - Username: word characters, dots, underscores, percent, plus, hyphen
  • @ - Literal @ symbol
  • [\w.-]+ - Domain name: word characters, dots, hyphens
  • \. - Literal dot
  • [A-Za-z]{2,} - Top-level domain: 2 or more letters

More strict pattern:

local pattern = regex.new('^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$')
-- Anchored to match entire string

URL Matching

Basic URL pattern:

local pattern = regex.new('(https?)://([^/\\s]+)([^\\s]*)')
-- Captures: protocol, domain, path

local match = pattern.match('https://example.com/path?query=value')
-- match[1] = "https://example.com/path?query=value" (full match)
-- match[2] = "https" (protocol)
-- match[3] = "example.com" (domain)
-- match[4] = "/path?query=value" (path)

With named captures:

local pattern = regex.new('(?<protocol>https?)://(?<domain>[^/\\s]+)(?<path>[^\\s]*)')

local match = pattern.match('https://example.com/path')
-- Access by name: match.domain (language binding dependent)
-- Access by number: match[3]

Phone Numbers

US phone number:

-- Format: 555-123-4567
local pattern = regex.new('\\d{3}-\\d{3}-\\d{4}')

-- With optional country code: +1-555-123-4567
local pattern = regex.new('(\\+1-)?\\d{3}-\\d{3}-\\d{4}')

-- With optional separators (-, ., space, or none)
local pattern = regex.new('\\d{3}[-. ]?\\d{3}[-. ]?\\d{4}')

International E.164 format:

-- +1234567890 to +123456789012345
local pattern = regex.new('\\+\\d{1,15}')

Date Matching

ISO 8601 date (YYYY-MM-DD):

local pattern = regex.new('\\d{4}-\\d{2}-\\d{2}')
-- Matches: 2025-10-14

-- With validation (basic):
local pattern = regex.new('\\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])')
-- Validates month (01-12) and day (01-31)

US date format (MM/DD/YYYY):

local pattern = regex.new('(0[1-9]|1[0-2])/(0[1-9]|[12]\\d|3[01])/\\d{4}')
-- Matches: 10/14/2025

Flexible date format:

local pattern = regex.new('\\d{1,2}[-/]\\d{1,2}[-/]\\d{2,4}')
-- Matches: 10/14/2025, 10-14-25, 1/5/2025

Time Matching

24-hour time (HH:MM):

local pattern = regex.new('([01]?\\d|2[0-3]):[0-5]\\d')
-- Matches: 09:30, 23:59, 8:05

-- With optional seconds:
local pattern = regex.new('([01]?\\d|2[0-3]):[0-5]\\d(:[0-5]\\d)?')
-- Matches: 09:30, 09:30:45

12-hour time with AM/PM:

local pattern = regex.new('(0?[1-9]|1[0-2]):[0-5]\\d\\s*([AaPp][Mm])')
-- Matches: 9:30 AM, 12:45 PM, 9:30AM

Password Validation

Minimum requirements (8+ chars, 1 uppercase, 1 lowercase, 1 digit):

local pattern = regex.new('^(?=.*[a-z])(?=.*[A-Z])(?=.*\\d).{8,}$')

Explanation:

  • ^ - Start of string
  • (?=.*[a-z]) - Lookahead: at least one lowercase
  • (?=.*[A-Z]) - Lookahead: at least one uppercase
  • (?=.*\d) - Lookahead: at least one digit
  • .{8,} - At least 8 characters
  • $ - End of string

With special character requirement:

local pattern = regex.new('^(?=.*[a-z])(?=.*[A-Z])(?=.*\\d)(?=.*[@$!%*?&]).{8,}$')

IP Address Matching

IPv4 address:

local pattern = regex.new('\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b')
-- Matches: 192.168.1.1, 10.0.0.1

-- With validation (0-255 per octet):
local pattern = regex.new('\\b(?:(?:25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\\.){3}(?:25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\\b')

IPv6 address (simplified):

local pattern = regex.new('(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}')
-- Matches full IPv6: 2001:0db8:85a3:0000:0000:8a2e:0370:7334

HTML/XML Tag Matching

Match opening and closing tags:

local pattern = regex.new('<([a-zA-Z][a-zA-Z0-9]*)\\b[^>]*>(.*?)</\\1>')
-- Matches: <div>content</div>, <span class="x">text</span>
-- Captures: tag name (group 1), content (group 2)

Extract tag content:

local pattern = regex.new('<[^>]+>(.*?)</[^>]+>')
-- Captures content between any tags

Match self-closing tags:

local pattern = regex.new('<[a-zA-Z][a-zA-Z0-9]*\\b[^>]*/>')
-- Matches: <br/>, <img src="x" />

CSV Parsing

Basic CSV field:

local pattern = regex.new('([^,]+),?')
-- Matches fields separated by commas

CSV with quoted fields:

local pattern = regex.new('(?:^|,)(?:\"([^\"]*(?:\"\"[^\"]*)*)\"|([^,]*))')
-- Handles: "quoted field", unquoted, "field with ""quotes"""

Word Extraction

Extract words:

local pattern = regex.new('\\b\\w+\\b')
-- Matches: any word (alphanumeric + underscore)

local pattern = regex.new('\\b[A-Za-z]+\\b')
-- Matches: only alphabetic words

Extract words with apostrophes:

local pattern = regex.new('\\b[A-Za-z]+(?:\'[A-Za-z]+)?\\b')
-- Matches: don't, it's, can't, etc.

Number Extraction

Integer:

local pattern = regex.new('-?\\d+')
-- Matches: 123, -456

Floating point:

local pattern = regex.new('-?\\d+\\.\\d+')
-- Matches: 123.45, -67.89

-- With optional decimal part:
local pattern = regex.new('-?\\d+(?:\\.\\d+)?')
-- Matches: 123, 123.45, -67.89

Scientific notation:

local pattern = regex.new('-?\\d+(?:\\.\\d+)?(?:[eE][+-]?\\d+)?')
-- Matches: 1.23e10, -4.5E-6, 123

Whitespace Handling

Trim leading/trailing whitespace:

local pattern = regex.new('^\\s+|\\s+$')
-- Use with replace to remove leading/trailing spaces

Collapse multiple spaces:

local pattern = regex.new('\\s+')
-- Replace with single space to normalize whitespace

Split on whitespace:

local pattern = regex.new('\\s+')
-- Use with split to separate words

File Path Matching

Unix/Linux path:

local pattern = regex.new('^(/[^/]+)+/?$')
-- Matches: /home/user/file.txt, /usr/local/bin/

Windows path:

local pattern = regex.new('^[A-Za-z]:\\\\(?:[^\\\\/:*?\"<>|]+\\\\)*[^\\\\/:*?\"<>|]*$')
-- Matches: C:\Users\Name\file.txt

File extension:

local pattern = regex.new('\\.([A-Za-z0-9]+)$')
-- Captures file extension: .txt, .pdf, .jpg

Version Number Matching

Semantic versioning:

local pattern = regex.new('^(0|[1-9]\\d*)\\.(0|[1-9]\\d*)\\.(0|[1-9]\\d*)(?:-((?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\\.(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\\+([0-9a-zA-Z-]+(?:\\.[0-9a-zA-Z-]+)*))?$')
-- Matches: 1.0.0, 2.1.3, 1.0.0-alpha.1, 1.0.0+build.123

Simple version:

local pattern = regex.new('\\d+\\.\\d+(?:\\.\\d+)?')
-- Matches: 1.0, 1.0.5, 2.10.1

Error Handling

When pattern compilation or matching fails, specific error types indicate the nature of the problem. Understanding these errors helps diagnose and fix pattern issues.

Compilation Errors

These errors occur when compiling a regex pattern:

Error Description Example
error_escape Invalid escape sequence \q (undefined escape), \c (not followed by letter), \x (not followed by two hex digits), \u{GGGG} (invalid hex)
error_brack Mismatched square brackets [abc, abc], [a[b] (nested)
error_paren Mismatched parentheses (abc, abc), ((a) (unclosed)
error_brace Mismatched curly braces a{3, a3}, a{2,} (missing closing brace)
error_badbrace Invalid quantifier range {3,2} (n > m), {-1} (negative), {,5} (missing n)
error_range Invalid character range in class [z-a] (reversed), [\u0100-\u0010] (start > end)
error_backref Invalid backreference \9 (group doesn't exist), \k<name> (name doesn't exist)
error_modifier Invalid flag modifier (?ii:...) (duplicate flag), (?i-i:...) (contradictory)
error_operator Invalid set operator usage [AB--CD] (mixed operators at same level), !! (reserved double punctuator in class)
error_noescape Character must be escaped [(] (should be [\(]), [{] (should be [\{]) in character classes
error_complement Invalid negation [^\p{RGI_Emoji}] (string property in negated class), \P{RGI_Emoji} (string property with \P)
error_badrepeat Quantifier without preceding expression *abc (starts with quantifier), a** (double quantifier)
error_utf8 Invalid UTF-8 sequence in pattern Pattern contains invalid UTF-8 bytes, overlong encoding, or code point > U+10FFFF

Error Examples

error_escape

-- Invalid: \q is not defined
local pattern = regex.new('\\q')  -- Error: invalid escape sequence

-- Invalid: \c not followed by letter
local pattern = regex.new('\\c5')  -- Error: expected A-Z or a-z after \c

-- Invalid: \x not followed by two hex digits
local pattern = regex.new('\\xGG')  -- Error: expected two hex digits

-- Invalid: code point exceeds maximum
local pattern = regex.new('\\u{110000}')  -- Error: code point > U+10FFFF

-- Valid alternatives:
local pattern = regex.new('q')           -- Literal q
local pattern = regex.new('\\x71')       -- Hex escape for q
local pattern = regex.new('\\u0071')     -- Unicode escape for q

error_brack

-- Invalid: unclosed bracket
local pattern = regex.new('[abc')  -- Error: missing ]

-- Invalid: extra closing bracket
local pattern = regex.new('abc]')  -- Error: unmatched ]

-- Valid:
local pattern = regex.new('[abc]')       -- Correct bracket pair
local pattern = regex.new('\\]')         -- Escaped bracket (literal)

error_paren

-- Invalid: unclosed parenthesis
local pattern = regex.new('(abc')  -- Error: missing )

-- Invalid: extra closing parenthesis
local pattern = regex.new('abc)')  -- Error: unmatched )

-- Valid:
local pattern = regex.new('(abc)')       -- Correct parenthesis pair
local pattern = regex.new('\\(abc\\)')   -- Escaped parentheses (literals)

error_brace

-- Invalid: unclosed brace
local pattern = regex.new('a{3')  -- Error: missing }

-- Valid:
local pattern = regex.new('a{3}')        -- Correct quantifier
local pattern = regex.new('\\{3\\}')     -- Escaped braces (literals)

error_badbrace

-- Invalid: n > m in range
local pattern = regex.new('a{5,3}')  -- Error: 5 > 3

-- Invalid: missing n
local pattern = regex.new('a{,5}')  -- Error: must specify n

-- Valid:
local pattern = regex.new('a{3,5}')      -- n ≤ m
local pattern = regex.new('a{3,}')       -- n or more (no maximum)
local pattern = regex.new('a{3}')        -- exactly n

error_range

-- Invalid: reversed range
local pattern = regex.new('[z-a]')  -- Error: z (U+007A) > a (U+0061)

-- Invalid: empty range
local pattern = regex.new('[\\u0100-\\u0010]')  -- Error: start > end

-- Valid:
local pattern = regex.new('[a-z]')       -- Correct range
local pattern = regex.new('[z]')         -- Single character (no range)

error_backref

-- Invalid: group doesn't exist
local pattern = regex.new('\\5')  -- Error: no group 5

-- Invalid: named group doesn't exist
local pattern = regex.new('\\k<missing>')  -- Error: no group named 'missing'

-- Valid:
local pattern = regex.new('(a)\\1')            -- Backreference to group 1
local pattern = regex.new('(?<x>a)\\k<x>')    -- Named backreference

error_modifier

-- Invalid: duplicate flag
local pattern = regex.new('(?ii:abc)')  -- Error: 'i' appears twice

-- Invalid: contradictory flags
local pattern = regex.new('(?i-i:abc)')  -- Error: both +i and -i

-- Valid:
local pattern = regex.new('(?i:abc)')          -- Single flag
local pattern = regex.new('(?im:abc)')         -- Multiple different flags
local pattern = regex.new('(?i-m:abc)')        -- Enable and disable flags

error_operator

-- Invalid: mixed operators at same level
local pattern = regex.new('[AB--CD]')  -- Error: union (AB) then subtraction

-- Invalid: reserved double punctuator
local pattern = regex.new('[a-z!!]')  -- Error: !! is reserved

-- Valid:
local pattern = regex.new('[[AB]--[CD]]')      -- Nested classes
local pattern = regex.new('[A[B--C]D]')        -- Operator in nested level
local pattern = regex.new('[a-z\\!\\!]')       -- Escaped (two separate !)

error_noescape

-- Invalid: ( must be escaped in character class
local pattern = regex.new('[(]')  -- Error: must escape (

-- Invalid: { must be escaped
local pattern = regex.new('[{]')  -- Error: must escape {

-- Valid:
local pattern = regex.new('[\\(]')             -- Escaped (
local pattern = regex.new('[\\{\\}]')          -- Escaped braces

error_complement

-- Invalid: string property in negated class
local pattern = regex.new('[^\\p{RGI_Emoji}]')  -- Error: cannot negate string property

-- Invalid: string property with \P
local pattern = regex.new('\\P{RGI_Emoji}')     -- Error: \P doesn't support string properties

-- Valid:
local pattern = regex.new('[\\p{RGI_Emoji}]')      -- String property in positive class
local pattern = regex.new('\\P{Emoji}')             -- Character property (not string)
local pattern = regex.new('[^\\p{Emoji}]')          -- Character property negated

error_badrepeat

-- Invalid: quantifier at start
local pattern = regex.new('*abc')  -- Error: nothing to repeat

-- Invalid: double quantifier
local pattern = regex.new('a**')   -- Error: quantifier on quantifier

-- Valid:
local pattern = regex.new('a*bc')              -- Quantifier after character
local pattern = regex.new('\\*abc')            -- Escaped * (literal)

error_utf8

-- Invalid UTF-8 in pattern
-- (This typically occurs when pattern strings contain invalid byte sequences)

-- Invalid: overlong encoding
local pattern = regex.new('\\xC0\\xB0')  -- Error: overlong form of U+0030

-- Valid:
local pattern = regex.new('\\x30')             -- Shortest form
local pattern = regex.new('\\u0030')           -- Unicode escape

Handling Errors in Code

Fluid:

-- Using catch for error handling
local err, pattern = catch(function()
   return regex.new('[invalid')
end)

if err then
   print('Pattern compilation failed: ' .. err.message)
   print('Error line: ' .. (err.line or 'unknown'))
else
   -- Use pattern
end

C++:

try {
   auto pattern = pf::regex("[invalid");
} catch (const std::exception& e) {
   std::cerr << "Pattern compilation failed: " << e.what() << std::endl;
}

Debugging Tips

  1. Test patterns incrementally: Build complex patterns step by step, testing each addition

  2. Use online regex testers: Many tools visualise patterns and highlight errors (ensure they support ECMAScript syntax)

  3. Check bracket matching: Count opening and closing brackets/parentheses/braces

  4. Validate escape sequences: Ensure all \ sequences are valid

  5. Review operator precedence: Verify set operations are properly nested

  6. Examine Unicode sequences: Confirm \u{...} values are valid code points

  7. Test with edge cases: Try empty strings, very long strings, and strings with special characters

Common Mistakes

Forgetting to escape special characters:

-- Wrong: . matches any character
local pattern = regex.new('file.txt')
-- Matches: "file.txt", "file?txt", "fileXtxt"

-- Correct: \. matches literal dot
local pattern = regex.new('file\\.txt')
-- Matches: "file.txt" only

Incorrect bracket nesting:

-- Wrong: brackets don't nest this way
local pattern = regex.new('[[a-z]')  -- Error

-- Correct: nest with operators
local pattern = regex.new('[[a-m][n-z]]')  -- Union of two ranges

Quantifier on quantifier:

-- Wrong: double quantifier
local pattern = regex.new('a*+')  -- Error

-- Correct: quantify group
local pattern = regex.new('(a*)+')

Summary

This manual has covered the complete regular expression syntax and features supported by the Parasol Framework:

  • Character matching including Unicode escapes and special characters
  • Character classes with ranges, predefined classes, and set operations
  • Quantifiers for controlling repetition (greedy and non-greedy)
  • Groups and backreferences for capturing and reusing matched text
  • Assertions for zero-width matching conditions
  • Unicode support with full UTF-8 and property matching
  • Flags for controlling compilation and matching behaviour
  • Performance considerations for efficient pattern usage
  • Common patterns for practical applications
  • Error handling for debugging pattern issues

For API documentation on the Regex class and its methods, please refer to the Regex module documentation in the Parasol API reference.


This manual documents the regex implementation as of 2025. For updates and the latest specification, refer to the ECMAScript Specification.