Author Topic: Regular Expressions: coding, examples, testing resources  (Read 5268 times)

karbock

  • Sr. Member
  • ****
  • Posts: 320
A Review of Regular Expressions

Table of contents

(1) Their use in MusicBee
(2) Characters
(3) Groups
(4) Other symbols
(5) External resources


(1) Their use in MusicBee


(1a) Purposes

A regular expression (RegEx) is a pattern used in comparison with any tag content for:
  • match tests
  • substring extractions
  • replacements


(1b) Virtual tag functions using RegExes

Function and syntaxPurposeIF <tag> matches RegEx,
returns:
ELSE,
returns:
$IsMatch(<tag>,RegEx)checks if the RegEx is present anywhere in the tagTF
$RxReplace(<tag>,RegEx,new)global string replacement<tag> content with each occurrence of RegEx replaced by the new replacement stringthe unmodified <tag> content
$RxMatch(<tag>,RegEx)first matching portionfirst portion of the tag matching the RegExa void string
$RxSplit(<tag>,RegEx,n)nth delimited sectionnth section of the <tag> content split up by using RegEx as delimiterthe unmodified <tag> content

Notes for the functions presented above:
  • use double quotes around the RegEx string or new replacement string if they include special characters such as comma, round bracket, angle bracket, single quote, i.e.:
    , ( ) < > '
  • the match test is case-insensitive by default ("x" matches both x and X)
    -> See "(4d) Case-sensitive matches";
  • the tag itself is not modified in the audio track nor in MusicBee's library.


(1c) Overview of RegEx symbols

RegEx symbols include: constituents, quantifiers, anchors and grouping marks.
  • Each RegEx constituent can be
    - a single character: literal, or belonging to a set/range, or
    - a group: several characters between rounded brackets (parentheses).
  • Any constituent is optionally followed by a quantifier, indicating how many times it must be present.
  • Anchors can be added to the RegEx to specify positions or boundaries.
  • You can specify whether the match must be case-sensitive, and for which RegEx portion.



(2) Characters


(2a) Literal characters and custom character classes

SymbolMeaning
aliteral character 'a'
All characters match themselves literally,
except those having special meaning in a RegEx:
. + * ? ^ $ ( ) [ ] { } | \
See also next table "(2c) Literal characters, backslashed".
[aeiou]any character in the set
[^aeiou]any character not in the set
[a-z]any character in the range
[^a-z]any character not in the range
.any character, except new line
[.]character '.'
\uFFFFUnicode (utf16) character with hex. code 'FFFF'



(2b) Pre-defined character classes (Unicode)

Backslash + lowercase letter = positive class.
Backslash + UPPERCASE  letter = negative class (any character not in the positive class).

SymbolMeaning
\wa word character: letter, digit, connector (hyphen, underscore)
\Wa non-word character
\da digit
\Da non-digit character
\sa white space character (including: space, tab, vertical tab, linefeed)
\Sa non-space character
\p{P}any punctuation character, such as:
. , ' ‘ " “ - - : ;
\P{P}any non-punctuation character
\p{S}any symbol character (currency, math, ...)
\P{S}any non-symbol character



(2c) Literal characters, backslashed

To match a character that has special meaning in regular expressions, precede it by a backslash.

SymbolMeaning
\.
\[   \]
The character after the backslash:
. [ ]
Meaning without \: see "(2a) Custom character classes".
\+   \*   \?
\{   \}
The character after the backslash:
+ * ? { }
Meaning without \: see "(4a) Quantifiers".
\(   \)
\|
The character after the backslash:
( ) |
Meaning without \: see "(3) Groups".
\^   \$The character after the backslash:
^ $
Meaning without \: see "(4c) Anchors".
\\a backslash



(2d) Combining characters within a RegEx

You can place single characters side-by-side as you wish.

SymbolMeaning
foobarliteral string 'foobar'
[a-z][0-9]any letter, immediately followed by any digit => matches a3, b0, c9, …



(3) Groups

A group is a series of contiguous single characters, defined by enclosing the string in rounded brackets (parentheses).
A group serves to:
  • store the result of the matching group in an indexed memory (1, 2, 3, …)
  • store the result of the matching group in a named memory
  • define the scope of the next quantifier (see "(4a) Quantifiers" below)
  • specify different options
The contents of the indexed/names memories can then be used in the replacement string (with RxReplace).
See "(3c) Substitutions" below.



(3a) Group types

SymbolGroup typeMeaning
(SubRegEx)indexed group, capturingTreats SubRegEx as a group and places the matching string in an indexed memory.
Each group of the whole RegEx is assigned an incremented number, starting from 1.
See "(3c) Substitutions" below.
(?:SubRegEx)non-capturing groupTreats SubRegEx as a group, without placing the matching string in an indexed memory.
Useful if you simply want to apply a quantifier to the group.
(?<alias>SubRegEx)named group, capturingThe matching string is stored in a named memory instead of an indexed one.
See "(3c) Substitutions" below.
(choice1|choice2|...)option group, capturingMatches any of the choices, and stores the matching string in an indexed memory.
Example:
    • (Bach|Mozart) matches 'Bach' and 'Mozart'
(?:choice1|choice2|...)option group, non-capturingMatches any of the choices, without storing the matching string in an indexed memory.



(3b) Lookarounds

'Lookarounds' are special groups used to define the context: what immediately precedes or follows.

SymbolLookaround typeMeaning
a(?=suffix)positive lookaheadMatches 'a' followed by 'suffix',
without including 'suffix' itself in the matching string.
a(?!suffix)negative lookaheadMatches 'a' not followed by 'suffix',
without including 'suffix' itself in the matching string.
(?<=prefix)apositive lookbehindMatches 'a' preceded by 'prefix',
without including 'prefix' itself in the matching string.
(?<!prefix)anegative lookbehindMatches 'a' not preceded by 'prefix',
without including 'prefix' itself in the matching string.



(3c) Substitutions (in the replacement string)

A substitution is the use of an indexed or named memory (capturing group) in the replacement string of $RxReplace.

In the examples below, <MyTag> contains 'FooBar'.

SymbolMeaningExample
$nindexed group n$RxReplace(<MyTag>,"(.{3})(.{3})","$2,$1")
-> 'Bar,Foo'
$`the substring before the match$RxReplace(<MyTag>,"a","$`")
-> 'FooBFooBr'
$'the substring after the match$RxReplace(<MyTag>,"a","$'")
-> 'FooBrr'
$+the last indexed memory$RxReplace(<MyTag>,"^(.{3})(.{3})(?:.*)$","$+")
-> 'Bar'
${alias}the content of stored memory alias
See: (?<alias>SubRegEx) in "(3) Groups"
$RxReplace(<MyTag>,"(?<First>.{3})(?<Second>.{3})","${Second}")
-> 'Bar'
$&the matching string$RxReplace(<MyTag>,"[aeiou]{2}","[$&]")
-> 'F[oo]Bar'
$_the whole input string$RxReplace(<MyTag>,"[aeiou]{2}","[$_]")
-> 'F[FooBar]Bar'



(3d) Backreferences (in the RegEx)

A backreference is the use of an indexed or named memory (capturing group) in the very RegEx where the memory is defined, thus not in the replacement string.

SymbolMeaningExample
\nindexed group n"(.)\1" matches a character followed by its repetition, such as:
aa, bb, cc, …

"(.)(.)\2\1" matches two characters followed by their repetition in mirror, such as:
ABBA, elle, otto, ...
\k<alias>named group alias"(?<one>.)\k<one>" matches a character followed by its repetition, such as:
aa, bb, cc, …

"(?<one>.)(?<two>.)\k<two>\k<one>" matches two characters followed by their repetition in mirror, such as:
ABBA, elle, otto, …



(4) Other symbols


(4a) Quantifiers

A quantifier is placed after a character or a group to indicate how many times it must be repeated (contiguously).

SymbolMeaningExample
+one or moreba+r matches: bar, baar, baaar, …
but not: br
?zero or oneba?r matches: br, bar
but not: baar
*zero or moreba*r matches: br, bar, baar, …
{m,n}between m and n times (both inclusive)ba{2,3}r matches: baar, baaar
{m,}at least m timesba{2,}r matches: baar, baaar, baaaar, …
{0,n}at most n timesba{0,2}r matches: br, bar, baar

The examples above match repeated single characters.
Here is one for a repeated group:
"(the ){2}" matches a string containing "the the ".



(4b) Greedy and lazy quantifiers

By default, a constituent + quantifier matches the longest possible substring (= greedy behaviour). So as to specify that it must match the shortest possible substring (= lazy behaviour), simply add ? after the quantifier.

GreedyLazyMeaning
++?one or more
???zero or one
**?zero or more
{m,n}{m,n}?between m and n times (both inclusive)
{m,}{m,}?at least m times
{0,n}{0,n}?at most n times



(4c) Anchors

Anchors are special markers: instead of matching actual characters, they specify the position relative to the string boundaries or word boundaries.

SymbolMeaningExample
^at start of string"^A" matches "A" at the beginning of a string
$at end of string"bee$" matches "bee" at the end of a string
\bon word boundary"\bbee" matches words starting with bee, such as "beehive"
"bee\b" matches words ending with bee, such as "MusicBee"
\Bnot on word boundary"\Bbee\B" matches "bee" in "bumblebees"



(4d) Case-sensitive matches

All match checks are case-insensitive by default in MusicBee (unlike other applications).
  • To make a RegEx part case-sensitive, prefix it with (?-i).
    Thus, placing (?-i) at the very beginning makes the whole RegEx case-sensitive.
  • To switch back to case-insensitive, add (?i) to the RegEx before the concerned part.

SymbolMeaningExample
(?-i)case-sensitive:
applies to the RegEx part to its right
"(?-i)a" matches "a" but not "A"
(?i)case-insensitive:
applies to the RegEx part to its right
"(?-i)a(?i)a" matches "aA" but not "AA"



(5) External resources


(5a) Exercises

Auto-corrected exercises. You can see the result of your RegEx as you type it.

(5b) Testing pages


MusicBee relies on .NET to interpret RegExes. RegEx interpreters other than .NET may give slightly different results, and so far I have found only one testing site providing with the .NET RegEx flavour.

For building, testing, and debugging your RegExes (match tests, replacements),
use this site:
  • https://RegEx101.com/
    - Select «.NET (C#)» as RegEx engine (FLAVOR pane on the left).
    - To the right of the REGULAR EXPRESSION, set regex options "g" (global) and "i" (case-insensitive).

Other RegEx flavours:

PCRE = Perl-Compatible Regular Expression


(5c) Unicode character tables

Reference tables in PDF format:

(5d) Supported Unicode character classes

Microsoft .NET reference:
Last Edit: October 09, 2023, 04:09:07 PM by karbock

hiccup

  • Sr. Member
  • ****
  • Posts: 7781
This is impressive and beautifully done. A great addition to Tips & Tricks!

I learned some new things, and am surely going to make good use of a favorites shortcut for it.
And I have some comments ;-) :

1.
For a(?!b) it says:
Matches 'a' followed by 'b', but does not match 'b' itself.
I think the 'not' was lost there, it should say 'not followed by...'

1b.
And maybe add the terms: lookahead and lookbehind to that part?
(maybe give them their own 'Lookarounds' header, instead of having them under 'Groups'?)
These are terminologies that people may do a search for.
(I would, and have done)

2.
It says:
"bee$" matches any string ending with "bee"
I'm wondering if a regex newbee might read that as if a whole string (tag) would get returned if it ends with bee.
Perhaps have it say something more specific like: matches 'bee' if the string ends with it.
(the same for: "^A" matches any string starting with "A")

3.
And while keeping my 'stupid hat' on:
There are explanations like these: "\Bbi\B" matches "carabiner"
It's a correct description for sure and how it is usually done for regex.
But to make it even more clear and literal for absolute regex newbies, perhaps do:
"\Bbi\B" matches 'bi' in 'carabiner'

4.
ma{2,3}n matches: maan, maaan
has a leftover 'n' behind {2,3} that shouldn't be there

5.
No regular expressions were harmed during the making of this post.
Well, I'm not sure about that ;-)
Under 'Case-sensitive matches' it says the formatting should be like this: {?-i}
I believe those should be round brackets, not curly ones.

6.
A question:
It's clearly explained what literals need to be escaped with a \ in a regex.
But what I sometimes run into is that I am not sure what literals need to be escaped when in a capturing group.
Not all of the above need to/should be escaped in that case, right?
Perhaps you know which ones do, and could list those too?

7.
Awesome work. I have thought about doing something like this also, but I wasn't up to the task and gave up on it, but I think you nailed it.
 
Last Edit: February 10, 2023, 04:39:37 PM by hiccup

karbock

  • Sr. Member
  • ****
  • Posts: 320
This is impressive and beautifully done. A great addition to Tips & Tricks!
I learned some new things, and am surely going to make good use of a favorites shortcut for it.
(...)
7. Awesome work. I have thought about doing something like this also, but I wasn't up to the task and gave up on it, but I think you nailed it.
(...)
And I have some comments ;-) :
A jong  ;) , thanks for your kind words and your keen eye!

1. it should say 'not followed by...'
1b. And maybe add the terms: lookahead and lookbehind to that part?
(maybe give them their own 'Lookarounds' header, instead of having them under 'Groups'?)
Okee!

2. + 3.
There is (as often) more than one way to say it. I don't mind changing if you think it sounds clearer.

4. ma{2,3}n matches: maan, maaan
has a leftover 'n' behind {2,3} that shouldn't be there
The suffixed n is not too much, since it's present in maan and maaan. Here, "a{2,3}" is both prefixed and suffixed.
ma{2,3}n matches: maan, maaan
I can add some formatting, as in the line above, but you have already experienced how unreadable a BBcode text can get for its author...

5. I believe those should be round brackets, not curly ones.
Sure thing. And I feel relieved that the RegEx Welfare Society didn't suit me... :)

6. Not all of the above need to/should be escaped in that case, right?
A first draft, since the answer is not obvious, as you mentioned:
between "[...]", I don't escape special characters except: [  ]  -
But as for capturing groups in general, I'll check in my Perl bible and perform some tests.

Besides, I liked the 'newbee' spelling: it fits very nicely in the context.  ;)
Last Edit: February 10, 2023, 06:36:30 PM by karbock

hiccup

  • Sr. Member
  • ****
  • Posts: 7781
The suffixed n is not too much, since it's present in maan and maaan. Here, "a{2,3}" is both prefixed and suffixed.
Ah, my stupid hat must still have been on when I read that as a mistake, my brain registered the 'n' as representing a number. (and those are already between the brackets)

Quote
A first draft, since the answer is not obvious, as you mentioned:
between "[...]", I don't escape special characters except: [  ]  -
But as for capturing groups in general, I'll check in my Perl bible and perform some tests.
Yeah, it's not too obvious a matter. I think to recall that either an opening, or a closing straight bracket does need escaping, but the other doesn't.
I'm sure you'll figure it out. TIA

Goed gedaan jochie!
(quote uit hele oude NL televisie reclame)
Last Edit: February 10, 2023, 06:43:25 PM by hiccup

karbock

  • Sr. Member
  • ****
  • Posts: 320
Added:
* Backreferences

@hiccup:
Still looking for an answer to nr. 6...
"I'm not a number, I'm a free man!"
Last Edit: February 10, 2023, 08:59:06 PM by karbock

boroda

  • Sr. Member
  • ****
  • Posts: 4595
i guess that it's worth mentioning that one of the numbers can be omitted for a repeated group, e.g.: a{3,} means that "a" must be repeated at least 3 times, and a{,3} means that "a" must be repeated at most 3 times.

hiccup

  • Sr. Member
  • ****
  • Posts: 7781
"I'm not a number, I'm a free man!"
That's what you think ;-)

I never saw that series originally (didn't even know it existed), but a friend alerted me to it some two years ago, and we then viewed it in full.
It was really fun and interesting.

And a thought: perhaps add a link to a good online regex tester at the bottom of the start post?
(I am mostly using https://regex101.com/ but perhaps you prefer another one)
Last Edit: February 11, 2023, 07:31:57 AM by hiccup

hiccup

  • Sr. Member
  • ****
  • Posts: 7781
There also exists a \u switch, which makes it possible to define a UTF16 unicode character.
I doubt if it would be used often, but in some rare cases it might be useful?

The syntax is: \unnnn
So e.g. \u200B would match a zero-width space.

And I found there exists:
$_  Substitutes the entire input string.
But I am not sure of it could be of much use in the context of using regex with MusicBee.
So maybe it's not worth adding it, I don't know.
Last Edit: February 11, 2023, 08:37:55 AM by hiccup

boroda

  • Sr. Member
  • ****
  • Posts: 4595
i'm not sure that .net regexes supports for $_. there are many derivations of regex specs.

hiccup

  • Sr. Member
  • ****
  • Posts: 7781
i'm not sure that .net regexes supports for $_. there are many derivations of regex specs.
It works, I tested before posting this.

hiccup

  • Sr. Member
  • ****
  • Posts: 7781
Diving in deeper, I also discovered so called "Unicode general categories".
See here: https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#SupportedUnicodeGeneralCategories

In practise, most will not be of much use for MusicBee regexes, but there might be a couple that are:

P (all punctuation characters)
S (all symbols)
Z (all separator characters) at second look, this one is probably useless for MB use

e.g. \p{P} will match any punctuation character such as . , ' ‘ " “- – : ;

What do you think?
You might object because it making things more complicated than maybe necessary.
But it does have the advantage that you e.g. don't need to worry about occurrences of straight vs curly quotes.
Last Edit: February 11, 2023, 11:37:18 AM by hiccup

karbock

  • Sr. Member
  • ****
  • Posts: 320
@boroda:

a{3,}: well spotted, I was just thinking: 'have I already mentioned it?'
a{,3}: not tested in MB so far, and not sure it's available in many regex parsers outside MB (doesn't exist in Perl), thus I'll propose a{0,3}, which is standard.
Just an example: https://stackoverflow.com/questions/22079519/why-isnt-the-x-at-most-n-times-regex-quantifier-included-in-java-in-the-x

@hiccup:

* https://regex101.com/: added, thanks!
* Unicode support: already considered when preparing the original post, I'll select the crème de la crème + external reference for those who wish extended documentation.
* $_: I had tested it, but didn't consider it (rightly or wrongly) as crucial in the first version. Added, example to come.
* An anecdote about Patrick McGoohan (who played the prisoner in the series): he was chosen only years later to play the warden in 'Escape from Alcatraz' (1979), based on true events.
Last Edit: February 11, 2023, 11:57:19 AM by karbock


hiccup

  • Sr. Member
  • ****
  • Posts: 7781
* Unicode support: already considered when preparing the original post, I'll select the crème de la crème + external reference for those who wish extended documentation.
I don't understand what you are saying here.
(I'll take a drink, maybe that helps)

karbock

  • Sr. Member
  • ****
  • Posts: 320
I don't understand what you are saying here.
(I'll take a drink, maybe that helps)
You could use a cuppa. :D I've just had one (Yes, weekend!)
To rephrase my previous post: I'm going to prepare an addition about Unicode in RegExes. Stay tuned.