REL

A Regular Expression composition Library

View the Project on GitHub Imaginatio/REL

REL, a Regular Expression composition Library

REL is a small utility Scala library for people dealing with complex, modular regular expressions. It defines a DSL with most of the operators you already know and love. This allows you to isolate portions of your regex for easier testing and reuse.

Consider the following YYYY-MM-DD date regex: ^(?:19|20)\d\d([- /.])(?:0[1-9]|1[012])\1(?:0[1-9]|[12]\d|3[01])$. It is a bit more readable and reusable expressed like this:

import fr.splayce.rel._
import Implicits._

val sep     = "[- /.]" \ "sep"            // group named "sep"
val year    = ("19" | "20") ~ """\d\d"""  // ~ is concatenation
val month   = "0[1-9]" | "1[012]"
val day     = "0[1-9]" | "[12]\\d" | "3[01]"
val dateYMD = ^ ~ year  ~ sep ~ month ~ !sep ~ day  ~ $
val dateMDY = ^ ~ month ~ sep ~ day   ~ !sep ~ year ~ $

These value are RE objects (also named terms or trees/_subtrees_), which can be converted to scala.util.matching.Regex instances either implicitly (by importing rel.Implicits._) or explicitly (via the .r method).

The embedded Date regexes and extractors will give you more complete examples, matching several date formats at once with little prior knowledge.

Supported opperators

Examples are noted DSL expressionresulting regex. They assume:

import fr.splayce.rel._
import Implicits._
val a = RE("aa")
val b = RE("bb")

Constants

A few "constants" (expression terms with no repetitions, capturing groups, or unprotected alternatives) are also pre-defined. Some of them have a UTF-8 Greek symbol alias for conciseness (import rel.Symbols._ to use them), uppercase for negation. You can add your own by instancing case class RECst(expr)

* Those are uppercase α/ß/μ/τ, not latin A/B/M/T

Exporting regexes (and other regex flavors)

The .r method on any RE term returns a compiled scala.util.matching.Regex. The .toString method returns the source pattern (equivalent to .r.toString, so the pattern is verified).

For other regex flavors, a translation mechanism is provided: you may instanciate a Flavor, which exposes two methods: .express(re: RE) and .translate(re: RE). The first one returns a Tuple2[String, List[String]], whose first element is the translated regex string and whose second is a list of the group names (in order of appearance) allowing you to perform a mapping to capturing group indexes (like Scala does) if needed. The second method only performs the translation of a RE term into another.

An example of translation into .NET-flavored regex is provided. DotNETTranslator contains the actual translation, DotNETFlavor being declared in the flavors package object. The translation:

Another example is the JavaScriptTranslator, which will mainly throw an exception when you try to translate a RE term that is not supported in the JavaScript regex flavor.

Regular-expression.info's regex flavors comparison chart may be of use when writing a translation.

Capturing Groups

Since a REL term is a tree, it can compute the resulting capturing groups tree with the matchGroup val, containing a tree of MatchGroups. The top group corresponds to the entire match: it is unnamed, contains the matched content and has the first-level capturing groups nested as subgroups. When applied to a Match, the content of each group is filled. Thus, you can use pattern matching with nested groups to extract any group at several levels of imbrication with little code.

For example, let's say we want to match simple usernames that have the form user@machine where both part have only alphabetic characters. We can define the regex:

val user     = α.+ \ "user"
val at       = "@"
val machine  = α.+ \ "machine"
val username = (user - at - machine) \ "username"

And make a simple extractor that yields a tuple of Strings:

val userMatcher: PartialFunction[MatchGroup, (String, String)] = {
  case MatchGroup(None, Some(_), List(              // this is the full match ($0)
      MatchGroup(Some("username"), Some(_), List(   // $1 / username
        MatchGroup(Some("user"),    Some(u), Nil),  // $2 / user
        MatchGroup(Some("machine"), Some(m), Nil)   // $3 / machine
      ))
    )) => (u, m)
}

Extraction in a String can be done like this:

import ByOptionExtractor._                    // lift (and toPM on further examples)
val userExtractor = username << lift(userMatcher)
val users = userExtractor("me@dev, you@dev")  // Iterator[(String, String)]
users.toList.toString === "List((me,dev), (you,dev))"

Java does not support named capturing groups, and Scala only emulates them, mapping a list of names given at the compilation of the Regex against the indexes of the capturing groups. Thus, it is risky to have multiple instances of the same group name. In practice, using myMatch.group("myGroup") seems to always refer to the last occurrence of the myGroup.

On the other hand, the Match object carries the full list of group names (in its eponymous groupNames val), and REL uses it to compute the group tree. Thus, you can reuse the same group name in a single expression.

Say we want to extract items formatted with username->username:

val interaction = username - "->" - username
val iaMatcher: PartialFunction[MatchGroup, (String, String)] = {
  case MatchGroup(None, Some(_), List(
      MatchGroup(Some("username"), Some(un1), _),
      MatchGroup(Some("username"), Some(un2), _)
    )) => (un1, un2)
}
val iaExtractor = interaction << lift(iaMatcher)
val interactions = iaExtractor("me@dev->you@dev, you@dev->me@dev")
interactions.toList.toString === "List((me@dev,you@dev), (you@dev,me@dev))"

You can of course reuse the same extractor, which can directly provide the extracted object. This requires us to place the extractor one level deeper to avoid the $0 group:

val userMatcher2: PartialFunction[MatchGroup, (String, String)] = {
  case MatchGroup(Some("username"), Some(_), List(
        MatchGroup(Some("user"),    Some(u), Nil),
        MatchGroup(Some("machine"), Some(m), Nil)
    )) => (u, m)
}
val userPattern = toPM(lift(userMatcher2))
val iaMatcher2: PartialFunction[MatchGroup, (String, String, String, String)] = {
  case MatchGroup(None, Some(_), List(
      userPattern(u1, m1),
      userPattern(u2, m2)
    )) => (u1, m1, u2, m2)
}
val iaExtractor2 = interaction << lift(iaMatcher2)
val interactions2 = iaExtractor2("me@dev->you@dev, you@dev->me@dev")
interactions2.toList.toString === "List((me,dev,you,dev), (you,dev,me,dev))"

TODO

Known issues

Versionning

REL version number follows the Semantic Versionning 2.0 Specification. In the current early stage of development, the API is still unstable and backward compatibility may break. However, in version (0.Y.Z), a Z-only version is expected to be backard compatible with previous 0.Y.* version. But a Y version change poteantially breaks backward compatibility.

String primitives

The string primitives are not parsed (use esc(str) to escape a string that should be matched literally). Hence:

Flavors

JavaScript regexes are very limited and work a bit differently. In JavaScript flavor

In .NET flavor, the group names are not guaranteed to be valid.

Usage and downloads

License

Copyright © 2012 Imaginatio SAS

REL is released under the MIT License

Authors

REL was developped by Adrien Lavoillotte (@streetpc) and Julien Martin for project Splayce at Imaginatio