1
Tips and Tricks / Regular Expressions: coding, examples, testing resources
« on: February 09, 2023, 07:37:58 PM »Table of contents (1) Their use in MusicBee
(1b) Virtual tag functions using RegExes (1c) Overview of RegEx symbols
(2b) Pre-defined character classes (Unicode) (2c) Literal characters, backslashed (2d) Combining characters within a RegEx
(3b) Lookarounds (3c) Substitutions (in the replacement string) (3d) Backreferences (in the RegEx)
(4b) Greedy and lazy quantifiers (4c) Anchors (4d) Case-sensitive matches
(5b) Testing pages (5c) Unicode character tables (5d) Supported Unicode character classes |
(1) Their use in MusicBee
(1a) Purposes
A regular expression (RegEx) is a pattern used in comparison with any tag content for:
- match tests
- substring extractions
- replacements
(1b) Virtual tag functions using RegExes
Function and syntax | Purpose | IF <tag> matches regexp, returns: | ELSE, returns: |
$IsMatch(<tag>,regex) | checks if the regex is present anywhere in the tag | T | F |
$RxReplace(<tag>,regex,new) | global string replacement | <tag> content with each occurrence of regex replaced by the new replacement string | the unmodified <tag> content |
$RxMatch(<tag>,regex) | first matching portion | first portion of the tag matching the regex | a void string |
$RxSplit(<tag>,regex,n) | nth delimited section | nth section of the <tag> content split up by using regex as delimiter | the unmodified <tag> content |
Notes for the functions presented above:
- use double quotes around the regex string or new replacement string if they include special characters such as comma, round bracket, angle bracket, single quote, i.e.:
, ( ) < > ' - the match test is case-insensitive by default ("x" matches both x and X)
? See "Case-sensitive matches"; - the tag itself is not modified in the audio track nor in MusicBee's library.
(1c) Overview of RegEx symbols
RegEx symbols include: constituents, quantifiers, anchors and grouping marks.
- Each RegEx constituent can be
- a single character: literal, or belonging to a set/range, or
- a group: several characters between rounded brackets (parentheses). - Any constituent is optionally followed by a quantifier, indicating how many times it must be present.
- Anchors can be added to the RegEx to specify positions or boundaries.
- You can specify whether the match must be case-sensitive, and for which RegEx portion.
(2) Characters
(2a) Literal characters and custom character classes
Symbol | Meaning |
a | literal character 'a' All characters match themselves literally, except those having special meaning in a RegEx: . + * ? ^ $ ( ) [ ] { } | \ See also next table "Literal characters, backslashed". |
[aeiou] | any character in the set |
[^aeiou] | any character not in the set |
[a-z] | any character in the range |
[^a-z] | any character not in the range |
. | any character, except new line |
[.] | character '.' |
\uFFFF | Unicode (utf16) character with hex. code 'FFFF' |
(2b) Pre-defined character classes (Unicode)
Backslash + lowercase letter = positive class.
Backslash + UPPERcase letter = negative class (all but be positive class).
Symbol | Meaning |
\w | a word character: letter, digit, connector (hyphen, underscore) |
\W | a non-word character |
\d | a digit |
\D | a non-digit character |
\s | a white space character (including: space, tab, vertical tab, linefeed) |
\S | a non-space character |
\p{P} | any punctuation character, such as: . , ' ‘ " “ - - : ; |
\P{P} | any non-punctuation character |
\p{S} | any symbol character (currency, math, ...) |
\P{S} | any non-symbol character |
(2c) Literal characters, backslashed
To match a character that has special meaning in regular expressions, precede it by a backslash.
Symbol | Meaning |
\. \[ \] | The character after the backslash: . [ ] Meaning without \: see "Custom character classes". |
\+ \* \? \{ \} | The character after the backslash: + * ? { } Meaning without \: see "Quantifiers". |
\( \) \| | The character after the backslash: ( ) | Meaning without \: see "Groups". |
\^ \$ | The character after the backslash: ^ $ Meaning without \: see "Anchors". |
\\ | a backslash |
(2d) Combining characters within a RegEx
You can place single characters side-by-side as you wish.
Symbol | Meaning |
foobar | literal string 'foobar' |
[a-z][0-9] | any letter, immediately followed by any digit => matches a3, b0, c9, … |
(3) Groups
A group is a series of contiguous single characters, defined by enclosing the string in rounded brackets (parentheses).
A group serves to:
- store the result of the matching group in an indexed memory (1, 2, 3, …)
- store the result of the matching group in a named memory
- define the scope of the next quantifier (see "Quantifiers" below)
- specify different options
See "Substitutions" below.
(3a) Group types
Symbol | Group type | Meaning |
(SubRegEx) | indexed group, capturing | Treats SubRegEx as a group and places the matching string in an indexed memory. Each group of the whole RegEx is assigned an incremented number, starting from 1. See "Substitutions" below. |
(?:SubRegEx) | non-capturing group | Treats SubRegEx as a group, without placing the matching string in an indexed memory. Useful if you simply want to apply a quantifier to the group. |
(?<alias>SubRegEx) | named group, capturing | The matching string is stored in a named memory instead of an indexed one. See "Substitutions" below. |
(choice1|choice2|...) | option group, capturing | Matches any of the choices, and stores the matching string in an indexed memory. Example: • (Bach|Mozart) matches 'Bach' and 'Mozart' |
(?:choice1|choice2|...) | option group, non-capturing | Matches any of the choices, without storing the matching string in an indexed memory. |
(3b) Lookarounds
'Lookarounds' are special groups used to define the context: what immediately precedes or follows.
Symbol | Lookaround type | Meaning |
a(?=suffix) | positive lookahead | Matches 'a' followed by 'suffix', without including 'suffix' itself in the matching string. |
a(?!suffix) | negative lookahead | Matches 'a' not followed by 'suffix', without including 'suffix' itself in the matching string. |
(?<=prefix)a | positive lookbehind | Matches 'a' preceded by 'prefix', without including 'prefix' itself in the matching string. |
(?<!prefix)a | negative lookbehind | Matches 'a' not preceded by 'prefix', without including 'prefix' itself in the matching string. |
(3c) Substitutions (in the replacement string)
A substitution is the use of an indexed or named memory (capturing group) in the replacement string of $RxReplace.
In the examples below, <MyTag> contains 'FooBar'.
Symbol | Meaning | Example |
$n | indexed group n | $RxReplace(<MyTag>,"(.{3})(.{3})","$2,$1") -> 'Bar,Foo' |
$` | the substring before the match | $RxReplace(<MyTag>,"a","$`") -> 'FooBFooBr' |
$' | the substring after the match | $RxReplace(<MyTag>,"a","$'") -> 'FooBrr' |
$+ | the last indexed memory | $RxReplace(<MyTag>,"^(.{3})(.{3})(?:.*)$","$+") -> 'Bar' |
${alias} | the content of stored memory alias See: (?<alias>SubRegEx) in "Groups" | $RxReplace(<MyTag>,"(?<First>.{3})(?<Second>.{3})","${Second}") -> 'Bar' |
$& | the matching string | $RxReplace(<MyTag>,"[aeiou]{2}","[$&]") -> 'F[oo]Bar' |
$_ | the whole input string | $RxReplace(<MyTag>,"[aeiou]{2}","[$_]") -> 'F[FooBar]Bar' |
(3d) Backreferences (in the RegEx)
A backreference is the use of an indexed or named memory (capturing group) in the very RegEx where the memory is defined, thus not in the replacement string.
Symbol | Meaning | Example |
\n | indexed group n | "(.)\1" matches a character followed by its repetition, such as: aa, bb, cc, … "(.)(.)\2\1" matches two characters followed by their repetition in mirror, such as: ABBA, elle, otto, ... |
\k<alias> | named group alias | "(?<one>.)\k<one>" matches a character followed by its repetition, such as: aa, bb, cc, … "(?<one>.)(?<two>.)\k<two>\k<one>" matches two characters followed by their repetition in mirror, such as: ABBA, elle, otto, … |
(4) Other symbols
(4a) Quantifiers
A quantifier is placed after a character or a group to indicate how many times it must be repeated (contiguously).
Symbol | Meaning | Example |
+ | one or more | ba+r matches: bar, baar, baaar, … but not: br |
? | zero or one | ba?r matches: br, bar but not: baar |
* | zero or more | ba*r matches: br, bar, baar, … |
{m,n} | between m and n times (both inclusive) | ba{2,3}r matches: baar, baaar |
{m,} | at least m times | ba{2,}r matches: baar, baaar, baaaar, … |
{0,n} | at most n times | ba{0,2}r matches: br, bar, baar |
The examples above match repeated single characters.
Here is one for a repeated group:
"(the ){2}" matches a string containing "the the ".
(4b) Greedy and lazy quantifiers
By default, a constituent + quantifier matches the longest possible substring (= greedy behaviour). So as to specify that it must match the shortest possible substring (= lazy behaviour), simply add ? after the quantifier.
Greedy | Lazy | Meaning |
+ | +? | one or more |
? | ?? | zero or one |
* | *? | zero or more |
{m,n} | {m,n}? | between m and n times (both inclusive) |
{m,} | {m,}? | at least m times |
{0,n} | {0,n}? | at most n times |
(4c) Anchors
Anchors are special markers: instead of matching actual characters, they specify the position relative to the string boundaries or word boundaries.
Symbol | Meaning | Example |
^ | at start of string | "^A" matches "A" at the beginning of a string |
$ | at end of string | "bee$" matches "bee" at the end of a string |
\b | on word boundary | "\bbee" matches words starting with bee, such as "beehive" "bee\b" matches words ending with bee, such as "MusicBee" |
\B | not on word boundary | "\Bbee\B" matches "bee" in "bumblebees" |
(4d) Case-sensitive matches
All match checks are case-insensitive by default in MusicBee (unlike other applications).
- To make a RegEx part case-sensitive, prefix it with (?-i).
Thus, placing (?-i) at the very beginning makes the whole RegEx case-sensitive. - To switch back to case-insensitive, add (?i) to the RegEx before the concerned part.
Symbol | Meaning | Example |
(?-i) | case-sensitive: applies to the RegEx part to its right | "(?-i)a" matches "a" but not "A" |
(?i) | case-insensitive: applies to the RegEx part to its right | "(?-i)a(?i)a" matches "aA" but not "AA" |
(5) External resources
(5a) Exercises
Auto-corrected exercises. You can see the result of your RegEx as you type it.
(5b) Testing pages
For building, testing, and debugging your RegExes (match tests, replacements).
Same purpose, simpler interfaces:
(5c) Unicode character tables
Reference tables in PDF format:
(5d) Supported Unicode character classes
Microsoft .NET reference: