DSL Syntax
Extractors
Matchers
Tree rewriting & Flavors
Cleaners
Limitations & Known Issues

REL

A Regular Expression composition Library

REL is a Scala library for people dealing with complex, modular regular expressions. It defines a DSL with most of the operators you already know and love. This allows you to isolate portions of your regex for easier testing and reuse.

Consider the following YYYY-MM-DD date regex: ^(?:19|20)\d\d([- /.])(?:0[1-9]|1[012])\1(?:0[1-9]|[12]\d|3[01])$
It is a bit more readable and reusable expressed like this:

import fr.splayce.rel._
import Implicits._

val sep     = "[- /.]" \ "sep"            // group named "sep"
val year    = ("19" | "20") ~ """\d\d"""  // ~ is concatenation
val month   = "0[1-9]" | "1[012]"
val day     = "0[1-9]" | "[12]\\d" | "3[01]"
val dateYMD = ^ ~ year  ~ sep ~ month ~ !sep ~ day  ~ $
val dateMDY = ^ ~ month ~ sep ~ day   ~ !sep ~ year ~ $

These values are RE objects (also named terms or trees/subtrees), which can be converted to scala.util.matching.Regex instances either implicitly (by importing rel.Implicits._) or explicitly (via the .r method).

The embedded Date regexes and extractors will give you more complete examples, matching several date formats at once with little prior knowledge.

Features

A familiar, regex-like syntax
Powerful extractors for scala Pattern Matching
Bundled matchers for frequently-used utilities like dates
Tree-rewriting utilities and flavors to use your regexes in other languages
Bundled cleaners to clean your input and further simplify your regexes

Usage and downloads

download the source from github and build the library with SBT
download the latest binary release
use our public Maven repository
check out the API reference

License

REL is released under the Creative Commons BY-NC-SA 4.0 International License.

Authors

REL was developed at Imaginatio for project Splayce by:

Adrien Lavoillotte (@streetpc)
Julien Martin

Contributors:

Guillaume Vauvert (@gvauvert) designed the TrackString algorithm

DSL Syntax

Some examples are noted DSL expression → resulting regex.
All assume:

import fr.splayce.rel._
import Implicits._
val a = RE("aa")
val b = RE("bb")

Operators

Binary operators

Operation	REL Syntax	RE Output
Alternative	`a \| b`	`aa\|bb`
Concatenation (protected)	`a ~ b`	`(?:aa)(?:bb)`
Concatenation (unprotected)	`a - b`	`aabb`

Generally speaking, you should start with protected concatenation. It is harder to read once serialized, but it is far safer from unwanted side-effects when reusing regex parts.

Quantifiers / repeaters

When used in the table below, the dot syntax a.? is recommended for clearer priority.

Quantifier	Greedy	Reluctant / Lazy	Possessive	Output for (greedy)
Option	`a.?`	`a.??`	`a.?+`	`(?:aa)?`
≥ 1	`a.+`	`a.+?`	`a.++`	`(?:aa)+`
≥ 0	`a.*`	`a.*?`	`a.*+`	`(?:aa)*`
At most	`a < 3`	`a.<?(3)`*	`a <+ 3`	`(?:aa){0,3}`
At least	`a > 3`	`a >? 3`	`a >+ 3`	`(?:aa){3,}`
In range	`a(1, 3)`, `a{1 to 3}` or `a{1 -> 3}`	`a(1, 3, Reluctant)`	`a(1, 3, Possessive)`	`(?:aa){1,3}`
Exactly	`a{3}` or `a(3)`	N/A	N/A	`(?:aa){3}`

* For reluctant at-most repeater, dotted form a.<?(3) is mandatory, standalone <? being syntactically significant in Scala (XMLSTART).

Look-around

	Prefixed form	Dotted form	Output
Look-ahead	`?=(a)`	`a.?=`	`(?=aa)`
Look-behind	`?<=(a)`	`a.?<=`	`(?<=aa)`
Negative look-ahead	`?!(a)`	`a.?!`	`(?!aa)`
Negative look-behind	`?<!(a)`	`a.?<!`	`(?<!aa)`

Grouping

Type	REL Syntax	Output
Named capturing	`a \ "group_a"`	`(aa)`.
Unnamed capturing *	`a.g`	`(aa)`
Back-reference	`g!`	`\1`**
Non-capturing	`a.ncg` or `a.%`	`(?:aa)`
Non-capturing, with flags	`a.ncg("i-d")` or `"i-d" ?: a`	`(?i-d:aa)`
Atomic	`a.ag`, `?>(a)` or `a.?>`	`(?>aa)`

* A unique group name is generated internally.
** Back-reference on most recent (i.e. rightmost previous) group g. val g = (a|b).g; g - a - !g → (aa|bb)aa\1

In a named capturing group, the name group_a will be passed to the Regex constructor, and queryable on corresponding Matches. If you export the regex to a flavor that supports inline embedding of capturing group names (like Java 7 or .NET), the name will be included in the output: (?<group_a>aa).

In non-capturing groups, REL tries not to uselessly wrap non-breaking entities — like single characters (a, \u00F0), character classes (\w, [^a-z], \p{Lu}), other groups — in order to produce ever-so-slightly less unreadable output. Non-capturing groups with flags are combined when nested, giving priority to innermost flags: a.ncg("-d").ncg("id") → (?i-d:aa).

Constants

A few “constants” (expression terms with no repetitions, capturing groups, or unprotected alternatives) are also predefined. Some of them have a UTF-8 Greek symbol alias for conciseness (import rel.Symbols._ to use them), uppercase for negation. You can add your own by instancing case class RECst(expr).

Object name	Symbol	Output / Matches
`Epsilon`	`ε`	Empty string
`Dot`	`τ`	`.`
`MLDot`	`ττ`	`[\s\S]` (will match any char, including line terminators, even when the `DOTALL` or `MULTILINE` modes are disabled)
`LineTerminator`	`Τ`*	`(?:\r\n?\|[\u000A-\u000C\u0085\u2028\u2029])` (line terminators, PCRE/Perl’s `\R`)
`AlphaLower`	none	`[a-z]`
`AlphaUpper`	none	`[A-Z]`
`Alpha`	`α`	`[a-zA-Z]`
`NotAlpha`	`Α`*	`[^a-zA-Z]`
`Letter`	`λ`	`\p{L}` (unicode letters, including diacritics)
`NotLetter`	`Λ`	`\P{L}`
`LetterLower`	none	`\p{Ll}`
`LetterUpper`	none	`\p{Lu}`
`Digit`	`δ`	`\d`
`NotDigit`	`Δ`	`\D`
`WhiteSpace`	`σ`	`\s`
`NotWhiteSpace`	`Σ`	`\S`
`Word`	`μ`	`\w` (`Alpha` or `_`)
`NotWord`	`Μ`*	`\W`
`WordBoundary`	`ß`	`\b`
`NotWordBoundary`	`Β`*	`\B`
`LineBegin`	`^`	`^`
`LineEnd`	`$`	`$`
`InputBegin`	`^^`	`\A`
`InputEnd`	`$$`	`\z`

* Those are uppercase α/ß/μ/τ, not latin A/B/M/T

Extractors

Extractors are meant to help you extract information from text, using the regexes you made, especially by using the text matched in the capturing groups of you regex.

Reminder on capturing groups

Capturing groups will yield String object that may be empty (if the group was matched to an empty part) or null (if the group wasn’t matched). For instance, matching the string "A" against the regex "(A)(B?)(C)?" will yield values "A", "", null.

For example, say we want to extract information form the captured matches of the following regex:

val abc = ("." \ "a") - (".".? \ "b") - ("." \ "c").?

An empty string won’t match and strings longer than 3 characters will match multiple times. Thus, for each match, the possible results are:

String	`a` (#1)	`b` (#2)	`c` (#3)
`"A"`	`"A"`	`""`	null
`"AB"`	`"A"`	`"B"`	null
`"ABC"`	`"A"`	`"B"`	`"C"`

Also, Java < 7 does not support named capturing groups, and Scala only emulates them, mapping a list of names given at the compilation of the Regex against the indexes of the capturing groups. Thus, it is risky to have multiple instances of the same group name. In practice, using myMatch.group("myGroup") seems to always refer to the last occurrence of the myGroup.

On the other hand, the Match object carries the full list of group names (in its eponymous groupNames val), and REL uses it to compute the group tree. Thus, you can reuse the same group name in a single expression.

The Extractor trait

The Extractor trait is mainly a function that takes in a String and gives an Iterator of the parametrized type, with utility methods for composing and pattern matching.

This Extractor trait works with a sub-extractor, which can be of two types:

A PartialFunction[Regex.Match, A], which is pattern matching-friendly
A PartialFunction[Regex.Match, Option[A]], which can allow a bit more flexibility and/or performance

RE expressions offer a utility << method to which you can pass in sub-extractors of either types, getting an Extractor[A] that you can apply to Strings to perform extraction.

Basic extractors

Some trivial sub-extractors are provided for convenience:

The simplest extractor is MatchedExtractor, which only yields every matches as Strings.
NthGroupExtractor yields the content matched by the nth capturing group, with n defaulting to 1.
NamedGroupExtractor does the same with the group holding the specified name.

Example:

val extractABC = abc << MatchedExtractor()
extractABC("1234567890").toList === List("123", "456", "789", "0")
val extractB = abc << NthGroupExtractor(2)
extractB("1234567890").toList === List("2", "5", "8", "")
val extractC = abc << NamedGroupExtractor("c")
extractC("1234567890").toList === List("3", "6", "9", null)

MatchGroups for quick, flat Pattern Matching

If you want to do pattern matching on the list of strings matched by the capturing groups, you can use:

MatchGroups(val1, val2, …) where valn are matched Strings that may be null or empty.
NotNull(opt1, Some(val2), …) where optn are Option[String]: Some(valn) if nth group matched (even if empty), None otherwise.
NotNull.NamedMap(map) where map will be a Map[String, Option[String]] with group names as keys.
NotNull.NamedPairs(pair1, (name2, opt2), …) where each pair is a Tuple2[String, Option[String]] with the group name and optional value.
NotEmpty, NotEmpty.NamedMap and NotEmpty.NamedPairs if you don’t care for empty matches: options will only be Some(value) if value is not an empty string.

Extractor examples:

import fr.splayce.rel.util.MatchGroups._
val pf: MatchExtractor[String] = {
  case NamedGroups("A", "", null)                 => "'A' only"
  case NotNull(Some("1"), Some(""), None)         => "'1' only"
  case NotNull.NamedMap(m) if (m contains "d")    => "unreachable"
  case NotNull.NamedPairs(_, ("b", Some("B")), _) => "b has 'B'"
  case NotEmpty(Some("x"), None, None)            => "'x' only"
  case NotEmpty.NamedMap(m) if (m contains "d")   => "unreachable"
  case NotEmpty.NamedPairs(_, ("b", Some(b)), _)  => "b has: " + b
}
val extract = re << pf
// extract(someString)

Reusable extractors: MatchGroup hierarchies

One of the incentive for using REL is to reuse regex parts in other regexes. So we also need a way to reuse the corresponding extractors, including nesting them in other extractors.

Since a REL term is a tree, it can compute the resulting capturing groups tree with the matchGroup val, containing a tree of MatchGroups. The top group corresponds to the entire match: it is unnamed, contains the matched content and has the first-level capturing groups nested as subgroups. When applied to a Match, it returns a copy of the capturing groups tree with the content filled for each group that matched. Thus, you can use pattern matching with nested groups to extract any group at several levels of imbrication with little code.

For example, let’s say we want to match simple usernames that have the form user@machine where both part have only alphabetic characters. We can define the regex:

val user     = α.+ \ "user"
val at       = "@"
val machine  = α.+ \ "machine"
val username = (user - at - machine) \ "username"

And make a simple extractor that yields a tuple of Strings:

val userMatcher: PartialFunction[MatchGroup, (String, String)] = {
  case MatchGroup(None, Some(_), List(              // $0
      MatchGroup(Some("username"), Some(_), List(   // $1
        MatchGroup(Some("user"),    Some(u), Nil),  // $2
        MatchGroup(Some("machine"), Some(m), Nil)   // $3
      ))
    )) => (u, m)
}

Extraction in a String can be done like this:

import ByOptionExtractor._   // lift (and toPM later)
val userExtractor = username << lift(userMatcher)
val users = userExtractor("me@dev, you@dev")  // Iterator
users.toList.toString === "List((me,dev), (you,dev))"

BTW, you don’t need lift if you use a Function[MatchGroup, Option[A] instead of a PartialFunction[MatchGroup, A].

Since REL supports multiple capturing groups with the same name, we can extract items formatted with username->username:

val interaction = username - "->" - username
val iaMatcher: PartialFunction[MatchGroup, (String, String)] = {
  case MatchGroup(None, Some(_), List(
      MatchGroup(Some("username"), Some(un1), _),
      MatchGroup(Some("username"), Some(un2), _)
    )) => (un1, un2)
}
val iaExtractor = interaction << lift(iaMatcher)
val interactions =
  iaExtractor("me@dev->you@dev, you@dev->me@dev")
interactions.toList.toString ===
  "List((me@dev,you@dev), (you@dev,me@dev))"

And then you make a reusable extractor, which can directly provide the extracted object. Just place the extractor one level deeper to avoid the $0 group:

val userMatcher2: PartialFunction[MatchGroup, (String, String)] = {
  case MatchGroup(Some("username"), Some(_), List(
      MatchGroup(Some("user"),    Some(u), Nil),
      MatchGroup(Some("machine"), Some(m), Nil)
    )) => (u, m)
}
val userPattern = toPM(lift(userMatcher2))
val iaMatcher2: PartialFunction[MatchGroup,
       (String, String, String, String)] = {
  case MatchGroup(None, Some(_), List(
      userPattern(u1, m1),
      userPattern(u2, m2)
    )) => (u1, m1, u2, m2)
}
val iaExtractor2 = interaction << lift(iaMatcher2)
val interactions2 =
  iaExtractor2("me@dev->you@dev, you@dev->me@dev")
interactions2.toList.toString ===
  "List((me,dev,you,dev), (you,dev,me,dev))"

In the same way, there are date extractor bundled in REL that can extract dates from strings, each match giving a list of possible dates interpretations (to account for ambiguity). See the doc on Matchers for more details.

The following example demonstrates the use of pattern matching directly on a String:

val nfDateX = fr.splayce.rel.matchers.DateExtractor.NUMERIC_FULL
"From 21/10/2000 to 21/11/2000" match {
  case nfDateX(List(a), List(b)) => (a.m, b.m) === (10, 11)
}

Debugging

Finally, the toString representation of a MatchGroup can be really helpful when debugging an Extractor or a regex.

scala> val nfd = fr.splayce.rel.matchers.Date.NUMERIC_FULL
scala> nfd.matchGroup
res1: fr.splayce.rel.util.MatchGroup = 
None	None
	Some(n_f)	None
		Some(n_ymd)	None
			Some(n_sep)	None
			Some(n_sep)	None
		Some(n_dmy)	None
			Some(n_sep)	None
			Some(n_sep)	None

The top group has no name (first column is None), for it represents the whole match. We can see the sub-hierarchy of named groups, but it has no content yet. To fill the content, it must be applied to a Match:

nfd.matchGroup(nfd.r.findFirstMatchIn("1998-10-20").get)
res2: fr.splayce.rel.util.MatchGroup = 
None	Some(1998-10-20)
	Some(n_f)	Some(1998-10-20)
		Some(n_ymd)	Some(1998-10-20)
			Some(n_sep)	Some(-)
			Some(n_sep)	None
		Some(n_dmy)	None
			Some(n_sep)	None
			Some(n_sep)	None

Then we can see which groups matched which part.

Matchers

REL comes with some matchers built-in for commonly needed entities like dates. Matchers are RE expressions and corresponding extractors.

As the matchers collection will probably grow, they may come in a separate packaging one day (e.g. rel-contrib), but should keep their package and class names.

Dates

The dates matchers provide regexes and extractors for dates, both numeric (1/23/12, 2012-01-23) and alphanumeric (january 23rd, 2012), for English and French, partial (with at least a month or a year) or full.

Regexes

Regex in `rel.matchers.…`	Matches
`Date.FULL`	`YMD` and `DMY` numerical formats
`Date.FULL_US`	`MDY`, `YMD` and `DMY` numerical formats
`Date.NUMERIC`	Numerical, including partial dates
`Date.NUMERIC_US`	Numerical, including `MDY` and partials
`en.Date.ALPHA`	English alphanumerical dates or partials
`en.Date.ALPHA_FULL`	English alphanumerical dates
`en.Date.ALL`	English alphanumerical or numerical dates or partials
`en.Date.ALL_FULL`	English alphanumerical or numerical dates
`fr.Date.ALPHA`	French alphanumerical dates or partials
`fr.Date.ALPHA_FULL`	French alphanumerical dates
`fr.Date.ALL`	French alphanumerical or numerical dates or partials
`fr.Date.ALL_FULL`	French alphanumerical or numerical dates

Extractors

Numerical dates may be ambiguous. For this reason, the date extractors will extract, for each match in the input string, a List[DateExtractor.Result]. Any additional disambiguation is left to client code.

DateExtractor.Result is a case class with three Option[Int]: y, m, d. A year may be on 2 or 4 digits, left for interpretation too. The toString method provides search-engine-friendly tokens: 1998-10-20 will yield Y1998 Y10 M10 D20 (note the doubling of the year in 2 digits form too).

The extractors to use are:

matchers.DateExtractor for numeric (full or partial, English or French)
matchers.en.FullDateExtractor for matchers.en.*_FULL
matchers.en.DateExtractor for matchers.en.*
matchers.fr.FullDateExtractor for matchers.fr.*_FULL
matchers.fr.DateExtractor for matchers.fr.*

To extract dates from a String:

import fr.splayce.rel.matchers
import matchers.DateExtractor.NUMERIC
import matchers.fr.{FullDateExtractor => FrDateExtr}

NUMERIC("2012-12-31").next // List(Y2012 Y12 M12 D31)
FrDateExtr("31 janvier 2013").next // List(Y2013 Y13 M01 D31)

We can also reuse those extractor in other extractors to get Lists of DateExtractor.Result directly form a String, Regex.Match, MatchGroup… DateExtratorSpec contains several examples and use cases.

Utilities

The matchers package also provides a few utility functions to help build other matchers. The escaped and unescaped functions build a RE expression to match an escaped (resp. unescaped) sub-expression, i.e. preceded by an even (resp. odd) number of the escape string.

Tree rewriting & Flavors

This chapter shows you how to recursively rewrite a REL expression, and how to use Flavors to express your regex on other flavors/languages than Scala/Java.

Subtree rewriting

An advantage to having a manipulable expression tree, other than reusing components, is that you can transform them as you please.

REL offers a way to do such manipulation quite simply using Scala’s powerful pattern matching. By passing a Rewriter to a RE object’s map method, you can recursively rewrite this object’s subtree. A Rewriter is actually a PartialFunction[RE, RE].

For example, we have a regex matching and capturing a UUID in its canonical form (lowercase hexadecimal, 8-4-4-4-12 digits). It is then used in a more complex expression as a capturing group.

val s = RE("-")
val h = RE("[0-9a-f]")
val uuid = h{8} - s - h{4} - s - h{4} - s - h{4} - s - h{12}
val complexExpression = /* … */ a ~ (uuid \ "uuid1") ~
    b ~ (uuid \ "uuid2") ~ c /* … */

Say we want to match a complexExpression elsewhere, without capturing the uuid. We can just transform capturing our capturing "uuid" groups into non-capturing groups:

val toOther: Rewriter = {
  case Group(_, uuid, _) => uuid.ncg
}
val other = complexExpression map toOther

Now, say we want uppercase hexadecimal in this expression, h is being also used in other places than uuid. We can complete our Rewriter:

val H = RE("[0-9A-F]")
val toOther: Rewriter = {
  case `h` => H
  case Group(_, uuid, _) => uuid.ncg
}
val other = complexExpression map toOther

Flavors

Other languages and tools have other regex flavors, with (sometimes subtle) differences in implementation and additional or lacking features (with respect to Java’s regex flavor). If we want to use our regexes in other flavors, we can apply some transformation to obtain compatible regexes (up to a point, the limit being unimplemented, unreplicable features).

For some differences, a simple transformation will suffice. For instance, .NET’s regex flavor considers that \w should match all letters, including diacritics (accented letters). Thus, DotNETFlavor will transform \ws (when used with μ/Word) into [a-zA-Z0-9_] to avoid unwanted surprises.
For some lacking features, an exact equivalent exists. Possessive quantifiers are not implemented in .NET, but it supports atomic grouping, and a possessive quantifier is no more than an atomic grouping of a greedy quantifier. DotNETFlavor therefore changes a++ into the equivalent expression (?>a+).
Other lacking feature can be emulated. JavaScript’s regex flavor does not support atomic grouping any more than possessive quantifiers. But atomic grouping may be emulated by capturing the expression in a look-ahead, then using an immediate back-reference to consume it without the possibility of backtracking. So JavaScriptFlavor mimics a++ (or (?>a+)) with (?=(a+))\1. It is a stretch, since it add a possibly undesired capturing group, but it’s still better than no support.
Some lacking features unfortunately cannot be emulated. For instance, JavaScript does not support look-behind at all. There is no way to emulate this support, so JavaScriptFlavor will throw an IllegalArgumentException when you try to convert an expression containing a look-behind.
Some additional features are implemented at REL level and can be used. Java priori to version 7 does not support inline naming of capturing groups, as .NET does. The DotNETFlavor (as well as the Java7Flavor) inlines the group names for capture ((?<name>expr)) and reference (\k<name>).

Flavors expose two main methods: .express(re: RE) and .translate(re: RE). The first one returns a Tuple2[String, List[String]], whose first element is the translated regex string and whose second is a list of the group names (in order of appearance) allowing you to perform a mapping to capturing group indexes (like Scala does) if needed. The second method only performs the translation of a RE term into another.

The following flavors are bundled with REL:

Java6, Java7
.NET
JavaScript
PCRE (C, PHP, Ruby 1.9 / Oniguruma…)
Legacy Ruby (Ruby 1.8, does not support any Unicode)

For example, to express a regex in the .NET regex flavor:

val myRegex = ^^ - (α.++ \ "firstWord")
DotNETFlavor.translate(myRegex) // approximately* ^^ - (?>(α.+) \ "firstWord")
DotNETFlavor.express(myRegex)._1 === "\A(?<firstWord>(?>[a-zA-Z]+))"
DotNETFlavor.express(myRegex)._2.toString === "List(firstWord)"

* approximately because the named capturing group will also have an inline naming strategy (for which there is no short DSL syntax, thus skipped here for the sake of simplicity)

But Flavors are not limited to other regex implementations. You can define your own for various uses, e.g.:

maintaining an easily readable/maintainable tree in your code, injecting more capturing before runtime
debugging existing regexes without altering the original RE tree
extending pre-existing/vendor regexes
reusing the same base regex in multiple contexts requiring small changes

Cleaners

Usage

Cleaner is really just a case class around a String => String function. It is aimed to help pre-processing text before matching; its usage is completely optional. It also holds some utility methods to ease composing and instantiation.

You create a Cleaner simply by giving it a function:

val lineBreakNormalizer = Cleaner(_.replaceAllLiterally("\r\n", "\n");

There is a shorthand for regex replacement:

val stripMultipleDashes = Cleaner.regexReplaceAll("--+".r, "-")

A Cleaner extends Function[String, String], you use it like any other function, either someCleaner(someString) or someCleaner.apply(someString).

The most readable/familiar form of composing is using unix-like pipes, intuitively applied from left to right:

val myCleaner = lineBreakNormalizer | TrimFilter | LowerCaseFilter

The pros of heavy cleaning when you can afford it is to match more variations (accents, case sensitivity, double spaces…) with simpler (and possibly faster) regexes. The cons are an upfront performance cost (not necessarily worse than a more complex/permissive regex) and more importantly matching on an altered text, making it harder to locate matches in the original text. This can be addressed by TrackString (covered later in this chapter) but at an additional performance cost.

Built-in Cleaners

In the built-in Cleaners, the naming convention follow these rules of thumb:

*Normalizer normalizes variations of the same things
*Cleaner cleans up information that is irrelevant for the task at hand
*Filter transforms the text to prepare it for matching

The bundled Cleaners are:

Name	Usage
`IdentityCleaner`	Utility no-op Cleaner
`CamelCaseSplitFilter`	Split CamelCase words; follows the the form `aBc` (lower-upper-lower): will split `someWords` but not `iOS` nor `VitaminC`
`LowerCaseFilter`	Transform the text in lowercase; a lowercase-only regexes will often outperform case-insensitive
`LineSeparatorNormalizer`	Normalize all Unicode line breaks and vertical tabs to ASCII new line `U+000A` / `\n`
`WhiteSpaceNormalizer`	Normalize all Unicode spaces and horizontal tabs to ASCII spaces `U+0020`
`WhiteSpaceCleaner`	Replace multiple instances of regular whitespaces (`\s+`) by a single space (strip line breaks)
`AllWhiteSpaceCleaner`	Replace multiple instances of all Unicode whitespaces by a single space (strip line breaks)
`SingleQuoteNormalizer`	Normalize frequent Unicode single quote / apostrophe variations (like prime of curved apostrophe) to ASCII straight apostrophe `U+0027` / `’`
`DoubleQuoteNormalizer`	Normalize frequent Unicode double quote variations to ASCII quotation mark `U+0022` / `"`
`QuoteNormalizer`	Combines `SingleQuoteNormalizer` and `DoubleQuoteNormalizer`
`DiacriticCleaner`	Pseudo ASCII folding, remove diacritical marks (like accents) and some common Unicode variants on Latin characters
`FullwidthNormalizer`	Normalize CJK Fullwidth Latin characters to their ASCII equivalents

Create a new Cleaner

You can of course create your own Cleaners.

If your cleaning operation can fit in a single regex replacement:

object HtmlTagCleaner extends Cleaner(
  Cleaner.regexReplaceFirst("<html[^>]*+>(.*)</html>", "$1"))
// or val htmlTagCleaner: Cleaner = Cleaner.regexReplaceFirst(…)

Same for multiple regex replacement:

object HtmlCommentsCleaner extends Cleaner(
  Cleaner.regexReplaceAll("<!--(.*)-->", ""))

Otherwise, you can simply instantiate a Cleaner with your own String => String transformation.

TrackStrings

TrackStrings are strings that can, to a certain extent, keep track of the shifts in positions. You pass Strings through Cleaners / regex replacement and remain able to get the position (or the best estimated range) in the original string of a [group of] character[s] that have moved in the resulting string:

import fr.splayce.rel.util.TrackString
val ts = TrackString("Test - OK - passed")
  .replaceAll(" - ".r, " ")  // "Test OK passed"
ts.srcPos(5, 7)   // Interval [7,9) was the original position of "OK"
ts.srcPos(8, 14)  // Interval [12,18) was the original position of "passed"

And Cleaners support TrackStrings, so this allows you to:

Clean an input string
Match it against a simplified regex (thanks to the cleaning)
Know the position of the match in the original uncleaned string (e.g. for highlighting matches, performing replacements, etc.)

import fr.splayce.rel.cleaners.CamelCaseSplitFilter
val os = "MySuperClass"
val ts = CamelCaseSplitFilter(TrackString(os))  // "My Super Class"
val m = "Super".r.findFirstMatchIn(ts.toString).get
val op = ts.srcPos(m.start, m.end)  // Interval [2,7)
val highlight = os.substring(0, op.start)
  + "<strong>" + os.substring(op.start, op.end) + "</strong>"
  + os.substring(op.end)
  // "My<strong>Super</strong>Class"

Please note that, while Cleaners made with Cleaner.regexReplaceFirst and Cleaner.regexReplaceAll automatically support position tracking, you will have to implement apply: TrackString => TrackString if you implement your own cleaners, e.g. by calling TrackString.edit (see the API doc). Otherwise, the TrackString will see your string transformation as one big replacement – i.e. it will tell you that the original position of any character is somewhere between the beginning and the end of your original string, which admittedly isn’t of much help.

Limitations & Known Issues

Versioning

REL version number follows the Semantic Versioning 2.0 Specification. In the current early stage of development, the API is still unstable and backward compatibility may break. As an additional rule, in version 0.Y.Z, a Z-only version change is expected to be backward compatible with previous 0.Y.* versions. But a Y version change potentially breaks backward compatibility.

DSL

There is no representation in the DSL for specific character ranges nor raw strings.

The string primitives are not parsed (use esc(str) to escape a string that should be matched literally). Hence:

Any capturing group you pass inside those strings are not taken into account by REL when the final regex is generated. The following groups and back-references will be shifted so the resulting regex will most probably be incorrect.
You still need to escape your expressions to match literally characters that are regex-significant like +, ? or (, even in RECst. Use esc(str) to escape the whole string.
Any regex you pass as a string is kept as-is when translated into different flavors. For instance, the \w passed in a string (as opposed to used with Word/μ) will not be translated by the DotNETFlavor.

Flavors

The Group names are checked but not inlined silently if they fail the validation, or if they are duplicated when the flavor requires unicity.

\uXXXX is not supported by PCRE, yet not translated by PCREFlavor so far.

JavaScript regexes are quite limited and work a bit differently. In JavaScript flavor:

WordBoundary/\b is kept as-is, but will not have exactly the same semantic because of the lack of Unicode support in JavaScript regex flavor. For instance, in "fiancé", Javascript sees "\bfianc\bé" where most other flavors see "\bfiancé\b". Same goes for NotWordBoundary/\B.
InputBegin (^^) and InputEnd ($$) are translated to LineBegin (^) and LineEnd ($), but this is only correct if the m (multiline) flag is off.

Cleaners

Not all Unicode ligatures and variations are known to DiacriticCleaner, for example:

Enclosed Alphanumeric Supplement: U+1F100-U+1F1FF (Unicode 6.1)
CJK Compatibility: U+3300-U+33FF (Unicode 6.0)
Latin Extended-D U+A720-U+A7FF (Unicode 5.1 to 6.1)

TrackString

Regex replacement in TrackString do not support Java 7 embedded group names, which are not accessible in Scala’s Match yet. It will use Scala group names instead (inconsistent with String#replaceAll).

TrackString cannot track intertwined/reordered replacements, i.e. you can only track abc => bca as a single group (as opposed to three reordered groups). If out-of-order Repl/Subst are introduced, srcPos will most probably yield incorrect results.

TODO

The following would be useful:

Core
- Add character range support (at DSL level), with inversion ([^...])
- Compatibility with Scala Parsers?
- Consider using 'symbols for group names
- Java 6/7 flavors: detect & fail on unbounded repeats in LookBehind ?
- Parse [and limit] regex strings inputted to REL, producing REL-only expression trees, thus eliminating some known issues (see below) and opening some possibilities (e.g. generating sample matching strings)
Matchers
- date: consider extracting incorrect dates (like feb. 31st) with some flag
Utils
- Generate sample strings that match a regex (e.g. with Xeger)
- Source generation or compiler plugin to enable REL independence [at runtime]
- Binary tool that would take a REL file, compile it and produce regexes in several flavors / programming languages

Contents

REL

Features

Usage and downloads

License

Authors

DSL Syntax

Operators

Binary operators

Quantifiers / repeaters

Look-around

Grouping

Constants

Extractors

Reminder on capturing groups

The Extractor trait

Basic extractors

MatchGroups for quick, flat Pattern Matching

Reusable extractors: MatchGroup hierarchies

Debugging

Matchers

Dates

Regexes

Extractors

Utilities

Tree rewriting & Flavors

Subtree rewriting

Flavors

Cleaners

Usage

Built-in Cleaners

Create a new Cleaner

TrackStrings

Limitations & Known Issues

Versioning

DSL

Flavors

Cleaners

TrackString

TODO