Author Topic: Regular Expressions: coding, examples, testing resources (Read 10573 times)

karbock · **on:** February 09, 2023, 07:37:58 PM

A Review of Regular Expressions

(1) Their use in MusicBee

(1a) Purposes

A regular expression (RegEx) is a pattern used in comparison with any tag content
in order to perform:

match tests
substring extractions
replacements

(1b) Virtual tag functions using RegExes

Function and syntax	Purpose	IF <tag> matches RegEx, returns:	ELSE, returns:
`$IsMatch(<tag>,RegEx)`	checks if the RegEx is present anywhere in the tag	T	F
`$RxReplace(<tag>,RegEx,new)`	global string replacement	<tag> content with each occurrence of RegEx replaced by the new replacement string	the unmodified <tag> content
`$RxMatch(<tag>,RegEx)`	first matching portion	first portion of the tag matching the RegEx	a void string
`$RxSplit(<tag>,RegEx,n)`	nth delimited section	nth section of the <tag> content split up by using RegEx as delimiter	the unmodified <tag> content

Notes for the functions presented above:

use double quotes around the RegEx string or new replacement string if they include special characters such as comma, round bracket, angle bracket, single quote, i.e.:
, ( ) < > '
the match test is case-insensitive by default ("x" matches both x and X)
-> See "(4d) Case-sensitive matches";
the tag itself is not modified in the audio track nor in MusicBee's library.

(1c) Overview of RegEx symbols

RegEx symbols include: constituents, quantifiers, anchors and grouping marks.

Each RegEx constituent can be
- a single character: literal, or belonging to a set/range, or
- a group: several characters between rounded brackets (parentheses).
Any constituent is optionally followed by a quantifier, indicating how many times it must be present.
Anchors can be added to the RegEx to specify positions or boundaries.
You can specify whether the match must be case-sensitive, and for which RegEx portion.

(2) Characters

(2a) Literal characters and custom character classes

Symbol	Meaning
`a`	literal character 'a' All characters match themselves literally, except those having special meaning in a RegEx: `. + * ? ^ $ ( ) [ ] { } \| \` See also next table "(2c) Literal characters, backslashed".
`[aeiou]`	any character in the set
`[^aeiou]`	any character not in the set
`[a-z]`	any character in the range
`[^a-z]`	any character not in the range
`.`	any character, except new line
`[.]`	character '.'
`\uFFFF`	Unicode (utf16) character with hex. code 'FFFF'

(2b) Pre-defined character classes (Unicode)

Backslash + lowercase letter = positive class.
Backslash + UPPERCASE letter = negative class (any character not in the positive class).

Symbol	Meaning
`\w`	a word character: letter, digit, connector (hyphen, underscore)
`\W`	a non-word character
`\d`	a digit
`\D`	a non-digit character
`\s`	a white space character (including: space, tab, vertical tab, linefeed)
`\S`	a non-space character
`\p{P}`	any punctuation character, such as: `. , ' ‘ " “ - - : ;`
`\P{P}`	any non-punctuation character
`\p{S}`	any symbol character (currency, math, ...)
`\P{S}`	any non-symbol character

(2c) Literal characters, backslashed

To match a character that has special meaning in regular expressions, precede it by a backslash.

Symbol	Meaning
`\. \[ \]`	The character after the backslash: `. [ ]` Meaning without \: see "(2a) Custom character classes".
`\+ \* \? \{ \}`	The character after the backslash: `+ * ? { }` Meaning without \: see "(4a) Quantifiers".
` \\|`	The character after the backslash: `( ) \|` Meaning without \: see "(3) Groups".
`\^ \$`	The character after the backslash: `^ $` Meaning without \: see "(4c) Anchors".
`\\`	a backslash

(2d) Combining characters within a RegEx

You can place single characters side-by-side as you wish.

Symbol	Meaning
`foobar`	literal string 'foobar'
`[a-z][0-9]`	any letter, immediately followed by any digit => matches a3, b0, c9, …

(3) Groups

A group is a series of contiguous single characters, defined by enclosing the string in rounded brackets (parentheses).
A group serves to:

store the result of the matching group in an indexed memory (1, 2, 3, …)
store the result of the matching group in a named memory
define the scope of the next quantifier (see "(4a) Quantifiers" below)
specify different options

The contents of the indexed/names memories can then be used in the replacement string (with RxReplace).
See "(3c) Substitutions" below.

(3a) Group types

Symbol	Group type	Meaning
`(SubRegEx)`	indexed group, capturing	Treats SubRegEx as a group and places the matching string in an indexed memory. Each group of the whole RegEx is assigned an incremented number, starting from 1. See "(3c) Substitutions" below.
`(?:SubRegEx)`	non-capturing group	Treats SubRegEx as a group, without placing the matching string in an indexed memory. Useful if you simply want to apply a quantifier to the group.
`(?<alias>SubRegEx)`	named group, capturing	The matching string is stored in a named memory instead of an indexed one. See "(3c) Substitutions" below.
`(choice1\|choice2\|...)`	option group, capturing	Matches any of the choices, and stores the matching string in an indexed memory. Example: • (Bach\|Mozart) matches 'Bach' and 'Mozart'
`(?:choice1\|choice2\|...)`	option group, non-capturing	Matches any of the choices, without storing the matching string in an indexed memory.

(3b) Lookarounds

'Lookarounds' are special groups used to define the context: what immediately precedes or follows.

Symbol	Lookaround type	Meaning
`a(?=suffix)`	positive lookahead	Matches 'a' followed by 'suffix', without including 'suffix' itself in the matching string.
`a(?!suffix)`	negative lookahead	Matches 'a' not followed by 'suffix', without including 'suffix' itself in the matching string.
`(?<=prefix)a`	positive lookbehind	Matches 'a' preceded by 'prefix', without including 'prefix' itself in the matching string.
`(?<!prefix)a`	negative lookbehind	Matches 'a' not preceded by 'prefix', without including 'prefix' itself in the matching string.

(3c) Substitutions (in the replacement string)

A substitution is the use of an indexed or named memory (capturing group) in the replacement string of $RxReplace.

In the examples below, <MyTag> contains 'FooBar'.

Symbol	Meaning	Example
`$n`	indexed group n	$RxReplace(<MyTag>,"(.{3})(.{3})","$2,$1") -> 'Bar,Foo'
$`	the substring before the match	$RxReplace(<MyTag>,"a","$`") -> 'FooBFooBr'
`$'`	the substring after the match	$RxReplace(<MyTag>,"a","$'") -> 'FooBrr'
`$+`	the last indexed memory	$RxReplace(<MyTag>,"^(.{3})(.{3})(?:.*)$","$+") -> 'Bar'
`${alias}`	the content of stored memory alias See: (?<alias>SubRegEx) in "(3) Groups"	$RxReplace(<MyTag>,"(?<First>.{3})(?<Second>.{3})","${Second}") -> 'Bar'
`$&`	the matching string	$RxReplace(<MyTag>,"[aeiou]{2}","[$&]") -> 'F[oo]Bar'
`$_`	the whole input string	$RxReplace(<MyTag>,"[aeiou]{2}","[$_]") -> 'F[FooBar]Bar'

(3d) Backreferences (in the RegEx)

A backreference is the use of an indexed or named memory (capturing group) in the very RegEx where the memory is defined, thus not in the replacement string.

Symbol	Meaning	Example
`\n`	indexed group n	"(.)\1" matches a character followed by its repetition, such as: aa, bb, cc, … "(.)(.)\2\1" matches two characters followed by their repetition in mirror, such as: ABBA, elle, otto, ...
`\k<alias>`	named group alias	"(?<one>.)\k<one>" matches a character followed by its repetition, such as: aa, bb, cc, … "(?<one>.)(?<two>.)\k<two>\k<one>" matches two characters followed by their repetition in mirror, such as: ABBA, elle, otto, …

(4) Other symbols

(4a) Quantifiers

A quantifier is placed after a character or a group to indicate how many times it must be repeated (contiguously).

Symbol	Meaning	Example
`+`	one or more	ba+r matches: bar, baar, baaar, … but not: br
`?`	zero or one	ba?r matches: br, bar but not: baar
`*`	zero or more	ba*r matches: br, bar, baar, …
`{m,n}`	between m and n times (both inclusive)	ba{2,3}r matches: baar, baaar
`{m,}`	at least m times	ba{2,}r matches: baar, baaar, baaaar, …
`{0,n}`	at most n times	ba{0,2}r matches: br, bar, baar

The examples above match repeated single characters.
Here is one for a repeated group:
"(the ){2}" matches a string containing "the the ".

(4b) Greedy and lazy quantifiers

By default, a constituent + quantifier matches the longest possible substring (= greedy behaviour). So as to specify that it must match the shortest possible substring (= lazy behaviour), simply add ? after the quantifier.

Greedy	Lazy	Meaning
`+`	`+?`	one or more
`?`	`??`	zero or one
`*`	`*?`	zero or more
`{m,n}`	`{m,n}?`	between m and n times (both inclusive)
`{m,}`	`{m,}?`	at least m times
`{0,n}`	`{0,n}?`	at most n times

(4c) Anchors

Anchors are special markers: instead of matching actual characters, they specify the position relative to the string boundaries or word boundaries.

Symbol	Meaning	Example
`^`	at start of string	"^A" matches "A" at the beginning of a string
`$`	at end of string	"bee$" matches "bee" at the end of a string
`\b`	on word boundary	"\bbee" matches words starting with bee, such as "beehive" "bee\b" matches words ending with bee, such as "MusicBee"
`\B`	not on word boundary	"\Bbee\B" matches "bee" in "bumblebees"

(4d) Case-sensitive matches

All match checks are case-insensitive by default in MusicBee (unlike other applications).

To make a RegEx part case-sensitive, prefix it with (?-i).
Thus, placing (?-i) at the very beginning makes the whole RegEx case-sensitive.
To switch back to case-insensitive, add (?i) to the RegEx before the concerned part.

Symbol	Meaning	Example
`(?-i)`	case-sensitive: applies to the RegEx part to its right	"(?-i)a" matches "a" but not "A"
`(?i)`	case-insensitive: applies to the RegEx part to its right	"(?-i)a(?i)a" matches "aA" but not "AA"

(5) External resources

(5a) Exercises

Auto-corrected exercises. You can see the result of your RegEx as you type it.

http://RegExtutorials.com/excercise.html

(5b) Testing pages

MusicBee relies on .NET to interpret RegExes. RegEx interpreters other than .NET may give slightly different results, and so far I have found only one testing site providing with the .NET RegEx flavour.

For building, testing, and debugging your RegExes (match tests, replacements),
use this site:

https://RegEx101.com/
- Select «.NET (C#)» as RegEx engine (FLAVOR pane on the left).
- To the right of the REGULAR EXPRESSION, set regex options "g" (global) and "i" (case-insensitive).

Other RegEx flavours:

https://RegExr.com/: PCRE, Javascript
https://www.regextester.com/: PCRE, Javascript
https://www.freeformatter.com/regex-tester.html: Javascript

PCRE = Perl-Compatible Regular Expression

(5c) Unicode character tables

Reference tables in PDF format:

https://www.unicode.org/charts/

(5d) Supported Unicode character classes

Microsoft .NET reference:

hiccup · **Reply #1 on:** February 10, 2023, 04:09:21 PM

This is impressive and beautifully done. A great addition to Tips & Tricks!

I learned some new things, and am surely going to make good use of a favorites shortcut for it.
And I have some comments ;-) :

1.
For a(?!b) it says:
Matches 'a' followed by 'b', but does not match 'b' itself.
I think the 'not' was lost there, it should say 'not followed by...'

1b.
And maybe add the terms: lookahead and lookbehind to that part?
(maybe give them their own 'Lookarounds' header, instead of having them under 'Groups'?)
These are terminologies that people may do a search for.
(I would, and have done)

2.
It says:
"bee$" matches any string ending with "bee"
I'm wondering if a regex newbee might read that as if a whole string (tag) would get returned if it ends with bee.
Perhaps have it say something more specific like: matches 'bee' if the string ends with it.
(the same for: "^A" matches any string starting with "A")

3.
And while keeping my 'stupid hat' on:
There are explanations like these: "\Bbi\B" matches "carabiner"
It's a correct description for sure and how it is usually done for regex.
But to make it even more clear and literal for absolute regex newbies, perhaps do:
"\Bbi\B" matches 'bi' in 'carabiner'

4.
ma{2,3}n matches: maan, maaan
has a leftover 'n' behind {2,3} that shouldn't be there

5.

Quote from: karbock on February 09, 2023, 07:37:58 PM

No regular expressions were harmed during the making of this post.

Well, I'm not sure about that ;-)
Under 'Case-sensitive matches' it says the formatting should be like this: {?-i}
I believe those should be round brackets, not curly ones.

6.
A question:
It's clearly explained what literals need to be escaped with a \ in a regex.
But what I sometimes run into is that I am not sure what literals need to be escaped when in a capturing group.
Not all of the above need to/should be escaped in that case, right?
Perhaps you know which ones do, and could list those too?

7.
Awesome work. I have thought about doing something like this also, but I wasn't up to the task and gave up on it, but I think you nailed it.

karbock · **Reply #2 on:** February 10, 2023, 06:26:15 PM

Quote from: hiccup on February 10, 2023, 04:09:21 PM

This is impressive and beautifully done. A great addition to Tips & Tricks!
I learned some new things, and am surely going to make good use of a favorites shortcut for it.
(...)
7. Awesome work. I have thought about doing something like this also, but I wasn't up to the task and gave up on it, but I think you nailed it.
(...)
And I have some comments ;-) :

A jong

, thanks for your kind words and your keen eye!

Quote from: hiccup on February 10, 2023, 04:09:21 PM

1. it should say 'not followed by...'
1b. And maybe add the terms: lookahead and lookbehind to that part?
(maybe give them their own 'Lookarounds' header, instead of having them under 'Groups'?)

Okee!

Quote from: hiccup on February 10, 2023, 04:09:21 PM

2. + 3.

There is (as often) more than one way to say it. I don't mind changing if you think it sounds clearer.

Quote from: hiccup on February 10, 2023, 04:09:21 PM

4. ma{2,3}n matches: maan, maaan
has a leftover 'n' behind {2,3} that shouldn't be there

The suffixed n is not too much, since it's present in maan and maaan. Here, "a{2,3}" is both prefixed and suffixed.
ma{2,3}n matches: maan, maaan
I can add some formatting, as in the line above, but you have already experienced how unreadable a BBcode text can get for its author...

Quote from: karbock on February 09, 2023, 07:37:58 PM

5. I believe those should be round brackets, not curly ones.

Sure thing. And I feel relieved that the RegEx Welfare Society didn't suit me...

Quote from: karbock on February 09, 2023, 07:37:58 PM

6. Not all of the above need to/should be escaped in that case, right?

A first draft, since the answer is not obvious, as you mentioned:
between "[...]", I don't escape special characters except: [ ] -
But as for capturing groups in general, I'll check in my Perl bible and perform some tests.

Besides, I liked the 'newbee' spelling: it fits very nicely in the context.

hiccup · **Reply #3 on:** February 10, 2023, 06:39:04 PM

Quote from: karbock on February 10, 2023, 06:26:15 PM

The suffixed n is not too much, since it's present in maan and maaan. Here, "a{2,3}" is both prefixed and suffixed.

Ah, my stupid hat must still have been on when I read that as a mistake, my brain registered the 'n' as representing a number. (and those are already between the brackets)

Quote

A first draft, since the answer is not obvious, as you mentioned:
between "[...]", I don't escape special characters except: [ ] -
But as for capturing groups in general, I'll check in my Perl bible and perform some tests.

Yeah, it's not too obvious a matter. I think to recall that either an opening, or a closing straight bracket does need escaping, but the other doesn't.
I'm sure you'll figure it out. TIA

Goed gedaan jochie!
(quote uit hele oude NL televisie reclame)

karbock · **Reply #4 on:** February 10, 2023, 08:55:31 PM

Added:
* Backreferences

@hiccup:
Still looking for an answer to nr. 6...
"I'm not a number, I'm a free man!"

boroda · **Reply #5 on:** February 10, 2023, 09:42:59 PM

i guess that it's worth mentioning that one of the numbers can be omitted for a repeated group, e.g.: a{3,} means that "a" must be repeated at least 3 times, and a{,3} means that "a" must be repeated at most 3 times.

hiccup · **Reply #6 on:** February 11, 2023, 07:07:42 AM

Quote from: karbock on February 10, 2023, 08:55:31 PM

"I'm not a number, I'm a free man!"

That's what you think ;-)

I never saw that series originally (didn't even know it existed), but a friend alerted me to it some two years ago, and we then viewed it in full.
It was really fun and interesting.

And a thought: perhaps add a link to a good online regex tester at the bottom of the start post?
(I am mostly using https://regex101.com/ but perhaps you prefer another one)

hiccup · **Reply #7 on:** February 11, 2023, 08:33:31 AM

There also exists a \u switch, which makes it possible to define a UTF16 unicode character.
I doubt if it would be used often, but in some rare cases it might be useful?

The syntax is: \unnnn
So e.g. \u200B would match a zero-width space.

And I found there exists:
$_ Substitutes the entire input string.
But I am not sure of it could be of much use in the context of using regex with MusicBee.
So maybe it's not worth adding it, I don't know.

boroda · **Reply #8 on:** February 11, 2023, 08:56:01 AM

i'm not sure that .net regexes supports for $_. there are many derivations of regex specs.

hiccup · **Reply #9 on:** February 11, 2023, 08:58:42 AM

Quote from: boroda on February 11, 2023, 08:56:01 AM

i'm not sure that .net regexes supports for $_. there are many derivations of regex specs.

It works, I tested before posting this.

hiccup · **Reply #10 on:** February 11, 2023, 10:49:03 AM

Diving in deeper, I also discovered so called "Unicode general categories".
See here: https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#SupportedUnicodeGeneralCategories

In practise, most will not be of much use for MusicBee regexes, but there might be a couple that are:

P (all punctuation characters)
S (all symbols)
~~Z (all separator characters)~~ at second look, this one is probably useless for MB use

e.g. \p{P} will match any punctuation character such as . , ' ‘ " “- – : ;

What do you think?
You might object because it making things more complicated than maybe necessary.
But it does have the advantage that you e.g. don't need to worry about occurrences of straight vs curly quotes.

karbock · **Reply #11 on:** February 11, 2023, 11:49:26 AM

@boroda:

a{3,}: well spotted, I was just thinking: 'have I already mentioned it?'
a{,3}: not tested in MB so far, and not sure it's available in many regex parsers outside MB (doesn't exist in Perl), thus I'll propose a{0,3}, which is standard.
Just an example: https://stackoverflow.com/questions/22079519/why-isnt-the-x-at-most-n-times-regex-quantifier-included-in-java-in-the-x

@hiccup:

* https://regex101.com/: added, thanks!
* Unicode support: already considered when preparing the original post, I'll select the crème de la crème + external reference for those who wish extended documentation.
* $_: I had tested it, but didn't consider it (rightly or wrongly) as crucial in the first version. Added, example to come.
* An anecdote about Patrick McGoohan (who played the prisoner in the series): he was chosen only years later to play the warden in 'Escape from Alcatraz' (1979), based on true events.

karbock · **Reply #12 on:** February 11, 2023, 01:47:15 PM

Added:
* $_ with example
* $& with example

hiccup · **Reply #13 on:** February 11, 2023, 02:29:14 PM

Quote from: karbock on February 11, 2023, 11:49:26 AM

* Unicode support: already considered when preparing the original post, I'll select the crème de la crème + external reference for those who wish extended documentation.

I don't understand what you are saying here.
(I'll take a drink, maybe that helps)

karbock · **Reply #14 on:** February 11, 2023, 02:42:51 PM

Quote from: hiccup on February 11, 2023, 02:29:14 PM

I don't understand what you are saying here.
(I'll take a drink, maybe that helps)

You could use a cuppa.

I've just had one (Yes, weekend!)
To rephrase my previous post: I'm going to prepare an addition about Unicode in RegExes. Stay tuned.

Author Topic: Regular Expressions: coding, examples, testing resources (Read 10573 times)

karbock

Regular Expressions: coding, examples, testing resources

hiccup

Re: Regular Expressions: coding, examples, exercises

karbock

Re: Regular Expressions: coding, examples, exercises

hiccup

Re: Regular Expressions: coding, examples, exercises

karbock

Re: Regular Expressions: coding, examples, exercises

boroda

Re: Regular Expressions: coding, examples, exercises

hiccup

Re: Regular Expressions: coding, examples, exercises

hiccup

Re: Regular Expressions: coding, examples, exercises

boroda

Re: Regular Expressions: coding, examples, exercises

hiccup

Re: Regular Expressions: coding, examples, exercises

hiccup

Re: Regular Expressions: coding, examples, exercises

karbock

Re: Regular Expressions: coding, examples, testing resources

karbock

Re: Regular Expressions: coding, examples, testing resources

hiccup

Re: Regular Expressions: coding, examples, testing resources

karbock

Re: Regular Expressions: coding, examples, testing resources