DSL Syntax

Some examples are noted DSL expressionresulting regex.
All assume:

import fr.splayce.rel._
import Implicits._
val a = RE("aa")
val b = RE("bb")

Operators

Binary operators

Operation REL Syntax RE Output
Alternative a | b aa|bb
Concatenation (protected) a ~ b (?:aa)(?:bb)
Concatenation (unprotected) a - b aabb

Generally speaking, you should start with protected concatenation. It is harder to read once serialized, but it is far safer from unwanted side-effects when reusing regex parts.

Quantifiers / repeaters

When used in the table below, the dot syntax a.? is recommended for clearer priority.

Quantifier Greedy Reluctant / Lazy Possessive Output for (greedy)
Option a.? a.?? a.?+ (?:aa)?
≥ 1 a.+ a.+? a.++ (?:aa)+
≥ 0 a.* a.*? a.*+ (?:aa)*
At most a < 3 a.<?(3)* a <+ 3 (?:aa){0,3}
At least a > 3 a >? 3 a >+ 3 (?:aa){3,}
In range a(1, 3), a{1 to 3} or a{1 -> 3} a(1, 3, Reluctant) a(1, 3, Possessive) (?:aa){1,3}
Exactly a{3} or a(3) N/A N/A (?:aa){3}

* For reluctant at-most repeater, dotted form a.<?(3) is mandatory, standalone <? being syntactically significant in Scala (XMLSTART).

Look-around

  Prefixed form  Dotted form Output
Look-ahead ?=(a) a.?= (?=aa)
Look-behind ?<=(a) a.?<= (?<=aa)
Negative look-ahead ?!(a) a.?! (?!aa)
Negative look-behind ?<!(a) a.?<! (?<!aa)

Grouping

Type REL Syntax Output
Named capturing a \ "group_a" (aa).
Unnamed capturing * a.g (aa)
Back-reference g! \1**
Non-capturing a.ncg or a.% (?:aa)
Non-capturing, with flags a.ncg("i-d") or "i-d" ?: a (?i-d:aa)
Atomic a.ag, ?>(a) or a.?> (?>aa)

* A unique group name is generated internally.
** Back-reference on most recent (i.e. rightmost previous) group g. val g = (a|b).g; g - a - !g(aa|bb)aa\1

In a named capturing group, the name group_a will be passed to the Regex constructor, and queryable on corresponding Matches. If you export the regex to a flavor that supports inline embedding of capturing group names (like Java 7 or .NET), the name will be included in the output: (?<group_a>aa).

In non-capturing groups, REL tries not to uselessly wrap non-breaking entities — like single characters (a, \u00F0), character classes (\w, [^a-z], \p{Lu}), other groups — in order to produce ever-so-slightly less unreadable output. Non-capturing groups with flags are combined when nested, giving priority to innermost flags: a.ncg("-d").ncg("id")(?i-d:aa).

Constants

A few “constants” (expression terms with no repetitions, capturing groups, or unprotected alternatives) are also predefined. Some of them have a UTF-8 Greek symbol alias for conciseness (import rel.Symbols._ to use them), uppercase for negation. You can add your own by instancing case class RECst(expr).

Object name Symbol Output / Matches
Epsilon ε Empty string
Dot τ .
MLDot ττ [\s\S] (will match any char, including line terminators, even when the DOTALL or MULTILINE modes are disabled)
LineTerminator Τ* (?:\r\n?|[\u000A-\u000C\u0085\u2028\u2029]) (line terminators, PCRE/Perl’s \R)
AlphaLower none [a-z]
AlphaUpper none [A-Z]
Alpha α [a-zA-Z]
NotAlpha Α* [^a-zA-Z]
Letter λ \p{L} (unicode letters, including diacritics)
NotLetter Λ \P{L}
LetterLower none \p{Ll}
LetterUpper none \p{Lu}
Digit δ \d
NotDigit Δ \D
WhiteSpace σ \s
NotWhiteSpace Σ \S
Word μ \w (Alpha or _)
NotWord Μ* \W
WordBoundary ß \b
NotWordBoundary Β* \B
LineBegin ^ ^
LineEnd $ $
InputBegin   ^^ \A
InputEnd $$ \z

* Those are uppercase α/ß/μ/τ, not latin A/B/M/T