REL

A Regular Expression composition Library

REL is a Scala library for people dealing with complex, modular regular expressions. It defines a DSL with most of the operators you already know and love. This allows you to isolate portions of your regex for easier testing and reuse.

Consider the following YYYY-MM-DD date regex: ^(?:19|20)\d\d([- /.])(?:0[1-9]|1[012])\1(?:0[1-9]|[12]\d|3[01])$
It is a bit more readable and reusable expressed like this:

import fr.splayce.rel._
import Implicits._

val sep     = "[- /.]" \ "sep"            // group named "sep"
val year    = ("19" | "20") ~ """\d\d"""  // ~ is concatenation
val month   = "0[1-9]" | "1[012]"
val day     = "0[1-9]" | "[12]\\d" | "3[01]"
val dateYMD = ^ ~ year  ~ sep ~ month ~ !sep ~ day  ~ $
val dateMDY = ^ ~ month ~ sep ~ day   ~ !sep ~ year ~ $

These values are RE objects (also named terms or trees/subtrees), which can be converted to scala.util.matching.Regex instances either implicitly (by importing rel.Implicits._) or explicitly (via the .r method).

The embedded Date regexes and extractors will give you more complete examples, matching several date formats at once with little prior knowledge.

Features

Usage and downloads

License

Copyright © 2012 Imaginatio SAS

REL is released under the Creative Commons BY-NC-SA 4.0 International License.

Authors

REL was developed at Imaginatio for project Splayce by:

Contributors:

DSL Syntax

Some examples are noted DSL expressionresulting regex.
All assume:

import fr.splayce.rel._
import Implicits._
val a = RE("aa")
val b = RE("bb")

Operators

Binary operators

Operation REL Syntax RE Output
Alternative a | b aa|bb
Concatenation (protected) a ~ b (?:aa)(?:bb)
Concatenation (unprotected) a - b aabb

Generally speaking, you should start with protected concatenation. It is harder to read once serialized, but it is far safer from unwanted side-effects when reusing regex parts.

Quantifiers / repeaters

When used in the table below, the dot syntax a.? is recommended for clearer priority.

Quantifier Greedy Reluctant / Lazy Possessive Output for (greedy)
Option a.? a.?? a.?+ (?:aa)?
≥ 1 a.+ a.+? a.++ (?:aa)+
≥ 0 a.* a.*? a.*+ (?:aa)*
At most a < 3 a.<?(3)* a <+ 3 (?:aa){0,3}
At least a > 3 a >? 3 a >+ 3 (?:aa){3,}
In range a(1, 3), a{1 to 3} or a{1 -> 3} a(1, 3, Reluctant) a(1, 3, Possessive) (?:aa){1,3}
Exactly a{3} or a(3) N/A N/A (?:aa){3}

* For reluctant at-most repeater, dotted form a.<?(3) is mandatory, standalone <? being syntactically significant in Scala (XMLSTART).

Look-around

  Prefixed form  Dotted form Output
Look-ahead ?=(a) a.?= (?=aa)
Look-behind ?<=(a) a.?<= (?<=aa)
Negative look-ahead ?!(a) a.?! (?!aa)
Negative look-behind ?<!(a) a.?<! (?<!aa)

Grouping

Type REL Syntax Output
Named capturing a \ "group_a" (aa).
Unnamed capturing * a.g (aa)
Back-reference g! \1**
Non-capturing a.ncg or a.% (?:aa)
Non-capturing, with flags a.ncg("i-d") or "i-d" ?: a (?i-d:aa)
Atomic a.ag, ?>(a) or a.?> (?>aa)

* A unique group name is generated internally.
** Back-reference on most recent (i.e. rightmost previous) group g. val g = (a|b).g; g - a - !g(aa|bb)aa\1

In a named capturing group, the name group_a will be passed to the Regex constructor, and queryable on corresponding Matches. If you export the regex to a flavor that supports inline embedding of capturing group names (like Java 7 or .NET), the name will be included in the output: (?<group_a>aa).

In non-capturing groups, REL tries not to uselessly wrap non-breaking entities — like single characters (a, \u00F0), character classes (\w, [^a-z], \p{Lu}), other groups — in order to produce ever-so-slightly less unreadable output. Non-capturing groups with flags are combined when nested, giving priority to innermost flags: a.ncg("-d").ncg("id")(?i-d:aa).

Constants

A few “constants” (expression terms with no repetitions, capturing groups, or unprotected alternatives) are also predefined. Some of them have a UTF-8 Greek symbol alias for conciseness (import rel.Symbols._ to use them), uppercase for negation. You can add your own by instancing case class RECst(expr).

Object name Symbol Output / Matches
Epsilon ε Empty string
Dot τ .
MLDot ττ [\s\S] (will match any char, including line terminators, even when the DOTALL or MULTILINE modes are disabled)
LineTerminator Τ* (?:\r\n?|[\u000A-\u000C\u0085\u2028\u2029]) (line terminators, PCRE/Perl’s \R)
AlphaLower none [a-z]
AlphaUpper none [A-Z]
Alpha α [a-zA-Z]
NotAlpha Α* [^a-zA-Z]
Letter λ \p{L} (unicode letters, including diacritics)
NotLetter Λ \P{L}
LetterLower none \p{Ll}
LetterUpper none \p{Lu}
Digit δ \d
NotDigit Δ \D
WhiteSpace σ \s
NotWhiteSpace Σ \S
Word μ \w (Alpha or _)
NotWord Μ* \W
WordBoundary ß \b
NotWordBoundary Β* \B
LineBegin ^ ^
LineEnd $ $
InputBegin   ^^ \A
InputEnd $$ \z

* Those are uppercase α/ß/μ/τ, not latin A/B/M/T

Extractors

Extractors are meant to help you extract information from text, using the regexes you made, especially by using the text matched in the capturing groups of you regex.

Reminder on capturing groups

Capturing groups will yield String object that may be empty (if the group was matched to an empty part) or null (if the group wasn’t matched). For instance, matching the string "A" against the regex "(A)(B?)(C)?" will yield values "A", "", null.

For example, say we want to extract information form the captured matches of the following regex:

val abc = ("." \ "a") - (".".? \ "b") - ("." \ "c").?

An empty string won’t match and strings longer than 3 characters will match multiple times. Thus, for each match, the possible results are:

String a (#1) b (#2) c (#3)
"A" "A" "" null
"AB" "A" "B" null
"ABC" "A" "B" "C"

Also, Java < 7 does not support named capturing groups, and Scala only emulates them, mapping a list of names given at the compilation of the Regex against the indexes of the capturing groups. Thus, it is risky to have multiple instances of the same group name. In practice, using myMatch.group("myGroup") seems to always refer to the last occurrence of the myGroup.

On the other hand, the Match object carries the full list of group names (in its eponymous groupNames val), and REL uses it to compute the group tree. Thus, you can reuse the same group name in a single expression.

The Extractor trait

The Extractor trait is mainly a function that takes in a String and gives an Iterator of the parametrized type, with utility methods for composing and pattern matching.

This Extractor trait works with a sub-extractor, which can be of two types:

RE expressions offer a utility << method to which you can pass in sub-extractors of either types, getting an Extractor[A] that you can apply to Strings to perform extraction.

Basic extractors

Some trivial sub-extractors are provided for convenience:

Example:

val extractABC = abc << MatchedExtractor()
extractABC("1234567890").toList === List("123", "456", "789", "0")
val extractB = abc << NthGroupExtractor(2)
extractB("1234567890").toList === List("2", "5", "8", "")
val extractC = abc << NamedGroupExtractor("c")
extractC("1234567890").toList === List("3", "6", "9", null)

MatchGroups for quick, flat Pattern Matching

If you want to do pattern matching on the list of strings matched by the capturing groups, you can use:

Extractor examples:

import fr.splayce.rel.util.MatchGroups._
val pf: MatchExtractor[String] = {
  case NamedGroups("A", "", null)                 => "'A' only"
  case NotNull(Some("1"), Some(""), None)         => "'1' only"
  case NotNull.NamedMap(m) if (m contains "d")    => "unreachable"
  case NotNull.NamedPairs(_, ("b", Some("B")), _) => "b has 'B'"
  case NotEmpty(Some("x"), None, None)            => "'x' only"
  case NotEmpty.NamedMap(m) if (m contains "d")   => "unreachable"
  case NotEmpty.NamedPairs(_, ("b", Some(b)), _)  => "b has: " + b
}
val extract = re << pf
// extract(someString)

Reusable extractors: MatchGroup hierarchies

One of the incentive for using REL is to reuse regex parts in other regexes. So we also need a way to reuse the corresponding extractors, including nesting them in other extractors.

Since a REL term is a tree, it can compute the resulting capturing groups tree with the matchGroup val, containing a tree of MatchGroups. The top group corresponds to the entire match: it is unnamed, contains the matched content and has the first-level capturing groups nested as subgroups. When applied to a Match, it returns a copy of the capturing groups tree with the content filled for each group that matched. Thus, you can use pattern matching with nested groups to extract any group at several levels of imbrication with little code.

For example, let’s say we want to match simple usernames that have the form user@machine where both part have only alphabetic characters. We can define the regex:

val user     = α.+ \ "user"
val at       = "@"
val machine  = α.+ \ "machine"
val username = (user - at - machine) \ "username"

And make a simple extractor that yields a tuple of Strings:

val userMatcher: PartialFunction[MatchGroup, (String, String)] = {
  case MatchGroup(None, Some(_), List(              // $0
      MatchGroup(Some("username"), Some(_), List(   // $1
        MatchGroup(Some("user"),    Some(u), Nil),  // $2
        MatchGroup(Some("machine"), Some(m), Nil)   // $3
      ))
    )) => (u, m)
}

Extraction in a String can be done like this:

import ByOptionExtractor._   // lift (and toPM later)
val userExtractor = username << lift(userMatcher)
val users = userExtractor("me@dev, you@dev")  // Iterator
users.toList.toString === "List((me,dev), (you,dev))"

BTW, you don’t need lift if you use a Function[MatchGroup, Option[A] instead of a PartialFunction[MatchGroup, A].

Since REL supports multiple capturing groups with the same name, we can extract items formatted with username->username:

val interaction = username - "->" - username
val iaMatcher: PartialFunction[MatchGroup, (String, String)] = {
  case MatchGroup(None, Some(_), List(
      MatchGroup(Some("username"), Some(un1), _),
      MatchGroup(Some("username"), Some(un2), _)
    )) => (un1, un2)
}
val iaExtractor = interaction << lift(iaMatcher)
val interactions =
  iaExtractor("me@dev->you@dev, you@dev->me@dev")
interactions.toList.toString ===
  "List((me@dev,you@dev), (you@dev,me@dev))"

And then you make a reusable extractor, which can directly provide the extracted object. Just place the extractor one level deeper to avoid the $0 group:

val userMatcher2: PartialFunction[MatchGroup, (String, String)] = {
  case MatchGroup(Some("username"), Some(_), List(
      MatchGroup(Some("user"),    Some(u), Nil),
      MatchGroup(Some("machine"), Some(m), Nil)
    )) => (u, m)
}
val userPattern = toPM(lift(userMatcher2))
val iaMatcher2: PartialFunction[MatchGroup,
       (String, String, String, String)] = {
  case MatchGroup(None, Some(_), List(
      userPattern(u1, m1),
      userPattern(u2, m2)
    )) => (u1, m1, u2, m2)
}
val iaExtractor2 = interaction << lift(iaMatcher2)
val interactions2 =
  iaExtractor2("me@dev->you@dev, you@dev->me@dev")
interactions2.toList.toString ===
  "List((me,dev,you,dev), (you,dev,me,dev))"

In the same way, there are date extractor bundled in REL that can extract dates from strings, each match giving a list of possible dates interpretations (to account for ambiguity). See the doc on Matchers for more details.

The following example demonstrates the use of pattern matching directly on a String:

val nfDateX = fr.splayce.rel.matchers.DateExtractor.NUMERIC_FULL
"From 21/10/2000 to 21/11/2000" match {
  case nfDateX(List(a), List(b)) => (a.m, b.m) === (10, 11)
}

Debugging

Finally, the toString representation of a MatchGroup can be really helpful when debugging an Extractor or a regex.

scala> val nfd = fr.splayce.rel.matchers.Date.NUMERIC_FULL
scala> nfd.matchGroup
res1: fr.splayce.rel.util.MatchGroup = 
None	None
	Some(n_f)	None
		Some(n_ymd)	None
			Some(n_sep)	None
			Some(n_sep)	None
		Some(n_dmy)	None
			Some(n_sep)	None
			Some(n_sep)	None

The top group has no name (first column is None), for it represents the whole match. We can see the sub-hierarchy of named groups, but it has no content yet. To fill the content, it must be applied to a Match:

nfd.matchGroup(nfd.r.findFirstMatchIn("1998-10-20").get)
res2: fr.splayce.rel.util.MatchGroup = 
None	Some(1998-10-20)
	Some(n_f)	Some(1998-10-20)
		Some(n_ymd)	Some(1998-10-20)
			Some(n_sep)	Some(-)
			Some(n_sep)	None
		Some(n_dmy)	None
			Some(n_sep)	None
			Some(n_sep)	None

Then we can see which groups matched which part.

Matchers

REL comes with some matchers built-in for commonly needed entities like dates. Matchers are RE expressions and corresponding extractors.

As the matchers collection will probably grow, they may come in a separate packaging one day (e.g. rel-contrib), but should keep their package and class names.

Dates

The dates matchers provide regexes and extractors for dates, both numeric (1/23/12, 2012-01-23) and alphanumeric (january 23rd, 2012), for English and French, partial (with at least a month or a year) or full.

Regexes

Regex in rel.matchers.… Matches
Date.FULL YMD and DMY numerical formats
Date.FULL_US MDY, YMD and DMY numerical formats
Date.NUMERIC Numerical, including partial dates
Date.NUMERIC_US Numerical, including MDY and partials
en.Date.ALPHA English alphanumerical dates or partials
en.Date.ALPHA_FULL English alphanumerical dates
en.Date.ALL English alphanumerical or numerical dates or partials
en.Date.ALL_FULL English alphanumerical or numerical dates
fr.Date.ALPHA French alphanumerical dates or partials
fr.Date.ALPHA_FULL French alphanumerical dates
fr.Date.ALL French alphanumerical or numerical dates or partials
fr.Date.ALL_FULL French alphanumerical or numerical dates

Extractors

Numerical dates may be ambiguous. For this reason, the date extractors will extract, for each match in the input string, a List[DateExtractor.Result]. Any additional disambiguation is left to client code.

DateExtractor.Result is a case class with three Option[Int]: y, m, d. A year may be on 2 or 4 digits, left for interpretation too. The toString method provides search-engine-friendly tokens: 1998-10-20 will yield Y1998 Y10 M10 D20 (note the doubling of the year in 2 digits form too).

The extractors to use are:

To extract dates from a String:

import fr.splayce.rel.matchers
import matchers.DateExtractor.NUMERIC
import matchers.fr.{FullDateExtractor => FrDateExtr}

NUMERIC("2012-12-31").next // List(Y2012 Y12 M12 D31)
FrDateExtr("31 janvier 2013").next // List(Y2013 Y13 M01 D31)

We can also reuse those extractor in other extractors to get Lists of DateExtractor.Result directly form a String, Regex.Match, MatchGroupDateExtratorSpec contains several examples and use cases.

Utilities

The matchers package also provides a few utility functions to help build other matchers. The escaped and unescaped functions build a RE expression to match an escaped (resp. unescaped) sub-expression, i.e. preceded by an even (resp. odd) number of the escape string.

Tree rewriting & Flavors

This chapter shows you how to recursively rewrite a REL expression, and how to use Flavors to express your regex on other flavors/languages than Scala/Java.

Subtree rewriting

An advantage to having a manipulable expression tree, other than reusing components, is that you can transform them as you please.

REL offers a way to do such manipulation quite simply using Scala’s powerful pattern matching. By passing a Rewriter to a RE object’s map method, you can recursively rewrite this object’s subtree. A Rewriter is actually a PartialFunction[RE, RE].

For example, we have a regex matching and capturing a UUID in its canonical form (lowercase hexadecimal, 8-4-4-4-12 digits). It is then used in a more complex expression as a capturing group.

val s = RE("-")
val h = RE("[0-9a-f]")
val uuid = h{8} - s - h{4} - s - h{4} - s - h{4} - s - h{12}
val complexExpression = /* … */ a ~ (uuid \ "uuid1") ~
    b ~ (uuid \ "uuid2") ~ c /* … */

Say we want to match a complexExpression elsewhere, without capturing the uuid. We can just transform capturing our capturing "uuid" groups into non-capturing groups:

val toOther: Rewriter = {
  case Group(_, uuid, _) => uuid.ncg
}
val other = complexExpression map toOther

Now, say we want uppercase hexadecimal in this expression, h is being also used in other places than uuid. We can complete our Rewriter:

val H = RE("[0-9A-F]")
val toOther: Rewriter = {
  case `h` => H
  case Group(_, uuid, _) => uuid.ncg
}
val other = complexExpression map toOther

Flavors

Other languages and tools have other regex flavors, with (sometimes subtle) differences in implementation and additional or lacking features (with respect to Java’s regex flavor). If we want to use our regexes in other flavors, we can apply some transformation to obtain compatible regexes (up to a point, the limit being unimplemented, unreplicable features).

Flavors expose two main methods: .express(re: RE) and .translate(re: RE). The first one returns a Tuple2[String, List[String]], whose first element is the translated regex string and whose second is a list of the group names (in order of appearance) allowing you to perform a mapping to capturing group indexes (like Scala does) if needed. The second method only performs the translation of a RE term into another.

The following flavors are bundled with REL:

For example, to express a regex in the .NET regex flavor:

val myRegex = ^^ - (α.++ \ "firstWord")
DotNETFlavor.translate(myRegex) // approximately* ^^ - (?>(α.+) \ "firstWord")
DotNETFlavor.express(myRegex)._1 === "\A(?<firstWord>(?>[a-zA-Z]+))"
DotNETFlavor.express(myRegex)._2.toString === "List(firstWord)"

* approximately because the named capturing group will also have an inline naming strategy (for which there is no short DSL syntax, thus skipped here for the sake of simplicity)

But Flavors are not limited to other regex implementations. You can define your own for various uses, e.g.:

Cleaners

Usage

Cleaner is really just a case class around a String => String function. It is aimed to help pre-processing text before matching; its usage is completely optional. It also holds some utility methods to ease composing and instantiation.

You create a Cleaner simply by giving it a function:

val lineBreakNormalizer = Cleaner(_.replaceAllLiterally("\r\n", "\n");

There is a shorthand for regex replacement:

val stripMultipleDashes = Cleaner.regexReplaceAll("--+".r, "-")

A Cleaner extends Function[String, String], you use it like any other function, either someCleaner(someString) or someCleaner.apply(someString).

The most readable/familiar form of composing is using unix-like pipes, intuitively applied from left to right:

val myCleaner = lineBreakNormalizer | TrimFilter | LowerCaseFilter

The pros of heavy cleaning when you can afford it is to match more variations (accents, case sensitivity, double spaces…) with simpler (and possibly faster) regexes. The cons are an upfront performance cost (not necessarily worse than a more complex/permissive regex) and more importantly matching on an altered text, making it harder to locate matches in the original text. This can be addressed by TrackString (covered later in this chapter) but at an additional performance cost.

Built-in Cleaners

In the built-in Cleaners, the naming convention follow these rules of thumb:

The bundled Cleaners are:

Name Usage
IdentityCleaner Utility no-op Cleaner
CamelCaseSplitFilter Split CamelCase words; follows the the form aBc (lower-upper-lower): will split someWords but not iOS nor VitaminC
LowerCaseFilter Transform the text in lowercase; a lowercase-only regexes will often outperform case-insensitive
LineSeparatorNormalizer Normalize all Unicode line breaks and vertical tabs to ASCII new line U+000A / \n
WhiteSpaceNormalizer Normalize all Unicode spaces and horizontal tabs to ASCII spaces U+0020
WhiteSpaceCleaner Replace multiple instances of regular whitespaces (\s+) by a single space (strip line breaks)
AllWhiteSpaceCleaner Replace multiple instances of all Unicode whitespaces by a single space (strip line breaks)
SingleQuoteNormalizer Normalize frequent Unicode single quote / apostrophe variations (like prime of curved apostrophe) to ASCII straight apostrophe U+0027 /
DoubleQuoteNormalizer Normalize frequent Unicode double quote variations to ASCII quotation mark U+0022 / "
QuoteNormalizer Combines SingleQuoteNormalizer and DoubleQuoteNormalizer
DiacriticCleaner Pseudo ASCII folding, remove diacritical marks (like accents) and some common Unicode variants on Latin characters
FullwidthNormalizer Normalize CJK Fullwidth Latin characters to their ASCII equivalents

Create a new Cleaner

You can of course create your own Cleaners.

TrackStrings

TrackStrings are strings that can, to a certain extent, keep track of the shifts in positions. You pass Strings through Cleaners / regex replacement and remain able to get the position (or the best estimated range) in the original string of a [group of] character[s] that have moved in the resulting string:

import fr.splayce.rel.util.TrackString
val ts = TrackString("Test - OK - passed")
  .replaceAll(" - ".r, " ")  // "Test OK passed"
ts.srcPos(5, 7)   // Interval [7,9) was the original position of "OK"
ts.srcPos(8, 14)  // Interval [12,18) was the original position of "passed"

And Cleaners support TrackStrings, so this allows you to:

import fr.splayce.rel.cleaners.CamelCaseSplitFilter
val os = "MySuperClass"
val ts = CamelCaseSplitFilter(TrackString(os))  // "My Super Class"
val m = "Super".r.findFirstMatchIn(ts.toString).get
val op = ts.srcPos(m.start, m.end)  // Interval [2,7)
val highlight = os.substring(0, op.start)
  + "<strong>" + os.substring(op.start, op.end) + "</strong>"
  + os.substring(op.end)
  // "My<strong>Super</strong>Class"

Please note that, while Cleaners made with Cleaner.regexReplaceFirst and Cleaner.regexReplaceAll automatically support position tracking, you will have to implement apply: TrackString => TrackString if you implement your own cleaners, e.g. by calling TrackString.edit (see the API doc). Otherwise, the TrackString will see your string transformation as one big replacement – i.e. it will tell you that the original position of any character is somewhere between the beginning and the end of your original string, which admittedly isn’t of much help.

Limitations & Known Issues

Versioning

REL version number follows the Semantic Versioning 2.0 Specification. In the current early stage of development, the API is still unstable and backward compatibility may break. As an additional rule, in version 0.Y.Z, a Z-only version change is expected to be backward compatible with previous 0.Y.* versions. But a Y version change potentially breaks backward compatibility.

DSL

There is no representation in the DSL for specific character ranges nor raw strings.

The string primitives are not parsed (use esc(str) to escape a string that should be matched literally). Hence:

Flavors

The Group names are checked but not inlined silently if they fail the validation, or if they are duplicated when the flavor requires unicity.

\uXXXX is not supported by PCRE, yet not translated by PCREFlavor so far.

JavaScript regexes are quite limited and work a bit differently. In JavaScript flavor:

Cleaners

Not all Unicode ligatures and variations are known to DiacriticCleaner, for example:

TrackString

Regex replacement in TrackString do not support Java 7 embedded group names, which are not accessible in Scala’s Match yet. It will use Scala group names instead (inconsistent with String#replaceAll).

TrackString cannot track intertwined/reordered replacements, i.e. you can only track abc => bca as a single group (as opposed to three reordered groups). If out-of-order Repl/Subst are introduced, srcPos will most probably yield incorrect results.

TODO

The following would be useful: