A Regular Expression composition Library
REL is a Scala library for people dealing with complex, modular regular expressions. It defines a DSL with most of the operators you already know and love. This allows you to isolate portions of your regex for easier testing and reuse.
Consider the following YYYY-MM-DD date regex: ^(?:19|20)\d\d([- /.])(?:0[1-9]|1[012])\1(?:0[1-9]|[12]\d|3[01])$
It is a bit more readable and reusable expressed like this:
import fr.splayce.rel._
import Implicits._
val sep = "[- /.]" \ "sep" // group named "sep"
val year = ("19" | "20") ~ """\d\d""" // ~ is concatenation
val month = "0[1-9]" | "1[012]"
val day = "0[1-9]" | "[12]\\d" | "3[01]"
val dateYMD = ^ ~ year ~ sep ~ month ~ !sep ~ day ~ $
val dateMDY = ^ ~ month ~ sep ~ day ~ !sep ~ year ~ $
These values are RE
objects (also named terms or trees/subtrees), which can be converted to scala.util.matching.Regex
instances either implicitly (by importing rel.Implicits._
) or explicitly (via the .r
method).
The embedded Date regexes and extractors will give you more complete examples, matching several date formats at once with little prior knowledge.
Copyright © 2012 Imaginatio SAS
REL is released under the Creative Commons BY-NC-SA 4.0 International License.
REL was developed at Imaginatio for project Splayce by:
Contributors:
TrackString
algorithm
Some examples are noted DSL expression
→ resulting regex
.
All assume:
import fr.splayce.rel._
import Implicits._
val a = RE("aa")
val b = RE("bb")
Operation | REL Syntax | RE Output |
---|---|---|
Alternative | a | b |
aa|bb |
Concatenation (protected) | a ~ b |
(?:aa)(?:bb) |
Concatenation (unprotected) | a - b |
aabb |
Generally speaking, you should start with protected concatenation. It is harder to read once serialized, but it is far safer from unwanted side-effects when reusing regex parts.
When used in the table below, the dot syntax a.?
is recommended for clearer priority.
Quantifier | Greedy | Reluctant / Lazy | Possessive | Output for (greedy) |
---|---|---|---|---|
Option | a.? |
a.?? |
a.?+ |
(?:aa)? |
≥ 1 | a.+ |
a.+? |
a.++ |
(?:aa)+ |
≥ 0 | a.* |
a.*? |
a.*+ |
(?:aa)* |
At most | a < 3 |
a.<?(3) * |
a <+ 3 |
(?:aa){0,3} |
At least | a > 3 |
a >? 3 |
a >+ 3 |
(?:aa){3,} |
In range | a(1, 3) , a{1 to 3} or a{1 -> 3} |
a(1, 3, Reluctant) |
a(1, 3, Possessive) |
(?:aa){1,3} |
Exactly | a{3} or a(3) |
N/A | N/A | (?:aa){3} |
* For reluctant at-most repeater, dotted form a.<?(3)
is mandatory, standalone <?
being syntactically significant in Scala (XMLSTART
).
Prefixed form | Dotted form | Output | |
---|---|---|---|
Look-ahead | ?=(a) |
a.?= |
(?=aa) |
Look-behind | ?<=(a) |
a.?<= |
(?<=aa) |
Negative look-ahead | ?!(a) |
a.?! |
(?!aa) |
Negative look-behind | ?<!(a) |
a.?<! |
(?<!aa) |
Type | REL Syntax | Output |
---|---|---|
Named capturing | a \ "group_a" |
(aa) . |
Unnamed capturing * | a.g |
(aa) |
Back-reference | g! |
\1 ** |
Non-capturing | a.ncg or a.% |
(?:aa) |
Non-capturing, with flags | a.ncg("i-d") or "i-d" ?: a |
(?i-d:aa) |
Atomic | a.ag , ?>(a) or a.?> |
(?>aa) |
* A unique group name is generated internally.
** Back-reference on most recent (i.e. rightmost previous) group g
. val g = (a|b).g; g - a - !g
→ (aa|bb)aa\1
In a named capturing group, the name group_a
will be passed to the Regex
constructor, and queryable on corresponding Match
es. If you export the regex to a flavor that supports inline embedding of capturing group names (like Java 7 or .NET), the name will be included in the output: (?<group_a>aa)
.
In non-capturing groups, REL tries not to uselessly wrap non-breaking entities — like single characters (a
, \u00F0
), character classes (\w
, [^a-z]
, \p{Lu}
), other groups — in order to produce ever-so-slightly less unreadable output. Non-capturing groups with flags are combined when nested, giving priority to innermost flags: a.ncg("-d").ncg("id")
→ (?i-d:aa)
.
A few “constants” (expression terms with no repetitions, capturing groups, or unprotected alternatives) are also predefined. Some of them have a UTF-8 Greek symbol alias for conciseness (import rel.Symbols._
to use them), uppercase for negation. You can add your own by instancing case class RECst(expr)
.
Object name | Symbol | Output / Matches |
---|---|---|
Epsilon |
ε |
Empty string |
Dot |
τ |
. |
MLDot |
ττ |
[\s\S] (will match any char, including line terminators, even when the DOTALL or MULTILINE modes are disabled) |
LineTerminator |
Τ * |
(?:\r\n?|[\u000A-\u000C\u0085\u2028\u2029]) (line terminators, PCRE/Perl’s \R ) |
AlphaLower |
none | [a-z] |
AlphaUpper |
none | [A-Z] |
Alpha |
α |
[a-zA-Z] |
NotAlpha |
Α * |
[^a-zA-Z] |
Letter |
λ |
\p{L} (unicode letters, including diacritics) |
NotLetter |
Λ |
\P{L} |
LetterLower |
none | \p{Ll} |
LetterUpper |
none | \p{Lu} |
Digit |
δ |
\d |
NotDigit |
Δ |
\D |
WhiteSpace |
σ |
\s |
NotWhiteSpace |
Σ |
\S |
Word |
μ |
\w (Alpha or _ ) |
NotWord |
Μ * |
\W |
WordBoundary |
ß |
\b |
NotWordBoundary |
Β * |
\B |
LineBegin |
^ |
^ |
LineEnd |
$ |
$ |
InputBegin |
^^ |
\A |
InputEnd |
$$ |
\z |
* Those are uppercase α
/ß
/μ
/τ
, not latin A
/B
/M
/T
Extractors are meant to help you extract information from text, using the regexes you made, especially by using the text matched in the capturing groups of you regex.
Capturing groups will yield String
object that may be empty (if the group was matched to an empty part) or null
(if the group wasn’t matched). For instance, matching the string "A"
against the regex "(A)(B?)(C)?"
will yield values "A", "", null
.
For example, say we want to extract information form the captured matches of the following regex:
val abc = ("." \ "a") - (".".? \ "b") - ("." \ "c").?
An empty string won’t match and strings longer than 3 characters will match multiple times. Thus, for each match, the possible results are:
String | a (#1) |
b (#2) |
c (#3) |
---|---|---|---|
"A" |
"A" |
"" |
null |
"AB" |
"A" |
"B" |
null |
"ABC" |
"A" |
"B" |
"C" |
Also, Java < 7 does not support named capturing groups, and Scala only emulates them, mapping a list of names given at the compilation of the Regex against the indexes of the capturing groups. Thus, it is risky to have multiple instances of the same group name. In practice, using myMatch.group("myGroup")
seems to always refer to the last occurrence of the myGroup
.
On the other hand, the Match
object carries the full list of group names (in its eponymous groupNames
val), and REL uses it to compute the group tree. Thus, you can reuse the same group name in a single expression.
The Extractor
trait is mainly a function that takes in a String
and gives an Iterator
of the parametrized type, with utility methods for composing and pattern matching.
This Extractor
trait works with a sub-extractor, which can be of two types:
PartialFunction[Regex.Match, A]
, which is pattern matching-friendly
PartialFunction[Regex.Match, Option[A]]
, which can allow a bit more flexibility and/or performance
RE
expressions offer a utility <<
method to which you can pass in sub-extractors of either types, getting an Extractor[A]
that you can apply
to String
s to perform extraction.
Some trivial sub-extractors are provided for convenience:
MatchedExtractor
, which only yields every matches as String
s.
NthGroupExtractor
yields the content matched by the nth capturing group, with n
defaulting to 1
.
NamedGroupExtractor
does the same with the group holding the specified name.
Example:
val extractABC = abc << MatchedExtractor()
extractABC("1234567890").toList === List("123", "456", "789", "0")
val extractB = abc << NthGroupExtractor(2)
extractB("1234567890").toList === List("2", "5", "8", "")
val extractC = abc << NamedGroupExtractor("c")
extractC("1234567890").toList === List("3", "6", "9", null)
If you want to do pattern matching on the list of strings matched by the capturing groups, you can use:
MatchGroups(val1, val2, …)
where valn
are matched String
s that may be null
or empty.
NotNull(opt1, Some(val2), …)
where optn
are Option[String]
: Some(valn)
if nth group matched (even if empty), None
otherwise.
NotNull.NamedMap(map)
where map
will be a Map[String, Option[String]]
with group names as keys.
NotNull.NamedPairs(pair1, (name2, opt2), …)
where each pair is a Tuple2[String, Option[String]]
with the group name and optional value.
NotEmpty
, NotEmpty.NamedMap
and NotEmpty.NamedPairs
if you don’t care for empty matches: options will only be Some(value)
if value
is not an empty string.
Extractor examples:
import fr.splayce.rel.util.MatchGroups._
val pf: MatchExtractor[String] = {
case NamedGroups("A", "", null) => "'A' only"
case NotNull(Some("1"), Some(""), None) => "'1' only"
case NotNull.NamedMap(m) if (m contains "d") => "unreachable"
case NotNull.NamedPairs(_, ("b", Some("B")), _) => "b has 'B'"
case NotEmpty(Some("x"), None, None) => "'x' only"
case NotEmpty.NamedMap(m) if (m contains "d") => "unreachable"
case NotEmpty.NamedPairs(_, ("b", Some(b)), _) => "b has: " + b
}
val extract = re << pf
// extract(someString)
One of the incentive for using REL is to reuse regex parts in other regexes. So we also need a way to reuse the corresponding extractors, including nesting them in other extractors.
Since a REL term is a tree, it can compute the resulting capturing groups tree with the matchGroup
val, containing a tree of MatchGroup
s. The top group corresponds to the entire match: it is unnamed, contains the matched content and has the first-level capturing groups nested as subgroups. When applied to a Match
, it returns a copy of the capturing groups tree with the content filled for each group that matched. Thus, you can use pattern matching with nested groups to extract any group at several levels of imbrication with little code.
For example, let’s say we want to match simple usernames that have the form user@machine
where both part have only alphabetic characters. We can define the regex:
val user = α.+ \ "user"
val at = "@"
val machine = α.+ \ "machine"
val username = (user - at - machine) \ "username"
And make a simple extractor that yields a tuple of Strings:
val userMatcher: PartialFunction[MatchGroup, (String, String)] = {
case MatchGroup(None, Some(_), List( // $0
MatchGroup(Some("username"), Some(_), List( // $1
MatchGroup(Some("user"), Some(u), Nil), // $2
MatchGroup(Some("machine"), Some(m), Nil) // $3
))
)) => (u, m)
}
Extraction in a String can be done like this:
import ByOptionExtractor._ // lift (and toPM later)
val userExtractor = username << lift(userMatcher)
val users = userExtractor("me@dev, you@dev") // Iterator
users.toList.toString === "List((me,dev), (you,dev))"
BTW, you don’t need lift
if you use a Function[MatchGroup, Option[A]
instead of a PartialFunction[MatchGroup, A]
.
Since REL supports multiple capturing groups with the same name, we can extract items formatted with username->username
:
val interaction = username - "->" - username
val iaMatcher: PartialFunction[MatchGroup, (String, String)] = {
case MatchGroup(None, Some(_), List(
MatchGroup(Some("username"), Some(un1), _),
MatchGroup(Some("username"), Some(un2), _)
)) => (un1, un2)
}
val iaExtractor = interaction << lift(iaMatcher)
val interactions =
iaExtractor("me@dev->you@dev, you@dev->me@dev")
interactions.toList.toString ===
"List((me@dev,you@dev), (you@dev,me@dev))"
And then you make a reusable extractor, which can directly provide the extracted object. Just place the extractor one level deeper to avoid the $0
group:
val userMatcher2: PartialFunction[MatchGroup, (String, String)] = {
case MatchGroup(Some("username"), Some(_), List(
MatchGroup(Some("user"), Some(u), Nil),
MatchGroup(Some("machine"), Some(m), Nil)
)) => (u, m)
}
val userPattern = toPM(lift(userMatcher2))
val iaMatcher2: PartialFunction[MatchGroup,
(String, String, String, String)] = {
case MatchGroup(None, Some(_), List(
userPattern(u1, m1),
userPattern(u2, m2)
)) => (u1, m1, u2, m2)
}
val iaExtractor2 = interaction << lift(iaMatcher2)
val interactions2 =
iaExtractor2("me@dev->you@dev, you@dev->me@dev")
interactions2.toList.toString ===
"List((me,dev,you,dev), (you,dev,me,dev))"
In the same way, there are date extractor bundled in REL that can extract dates from strings, each match giving a list of possible dates interpretations (to account for ambiguity). See the doc on Matchers for more details.
The following example demonstrates the use of pattern matching directly on a String
:
val nfDateX = fr.splayce.rel.matchers.DateExtractor.NUMERIC_FULL
"From 21/10/2000 to 21/11/2000" match {
case nfDateX(List(a), List(b)) => (a.m, b.m) === (10, 11)
}
Finally, the toString
representation of a MatchGroup
can be really helpful when debugging an Extractor or a regex.
scala> val nfd = fr.splayce.rel.matchers.Date.NUMERIC_FULL
scala> nfd.matchGroup
res1: fr.splayce.rel.util.MatchGroup =
None None
Some(n_f) None
Some(n_ymd) None
Some(n_sep) None
Some(n_sep) None
Some(n_dmy) None
Some(n_sep) None
Some(n_sep) None
The top group has no name (first column is None
), for it represents the whole match. We can see the sub-hierarchy of named groups, but it has no content yet. To fill the content, it must be applied to a Match
:
nfd.matchGroup(nfd.r.findFirstMatchIn("1998-10-20").get)
res2: fr.splayce.rel.util.MatchGroup =
None Some(1998-10-20)
Some(n_f) Some(1998-10-20)
Some(n_ymd) Some(1998-10-20)
Some(n_sep) Some(-)
Some(n_sep) None
Some(n_dmy) None
Some(n_sep) None
Some(n_sep) None
Then we can see which groups matched which part.
REL comes with some matchers built-in for commonly needed entities like dates. Matchers are RE
expressions and corresponding extractors.
As the matchers collection will probably grow, they may come in a separate packaging one day (e.g. rel-contrib), but should keep their package and class names.
The dates matchers provide regexes and extractors for dates, both numeric (1/23/12, 2012-01-23) and alphanumeric (january 23rd, 2012), for English and French, partial (with at least a month or a year) or full.
Regex in rel.matchers.… |
Matches |
---|---|
Date.FULL |
YMD and DMY numerical formats |
Date.FULL_US |
MDY , YMD and DMY numerical formats |
Date.NUMERIC |
Numerical, including partial dates |
Date.NUMERIC_US |
Numerical, including MDY and partials |
en.Date.ALPHA |
English alphanumerical dates or partials |
en.Date.ALPHA_FULL |
English alphanumerical dates |
en.Date.ALL |
English alphanumerical or numerical dates or partials |
en.Date.ALL_FULL |
English alphanumerical or numerical dates |
fr.Date.ALPHA |
French alphanumerical dates or partials |
fr.Date.ALPHA_FULL |
French alphanumerical dates |
fr.Date.ALL |
French alphanumerical or numerical dates or partials |
fr.Date.ALL_FULL |
French alphanumerical or numerical dates |
Numerical dates may be ambiguous. For this reason, the date extractors will extract, for each match in the input string, a List[DateExtractor.Result]
. Any additional disambiguation is left to client code.
DateExtractor.Result
is a case class with three Option[Int]
: y
, m
, d
. A year may be on 2 or 4 digits, left for interpretation too. The toString
method provides search-engine-friendly tokens: 1998-10-20
will yield Y1998 Y10 M10 D20
(note the doubling of the year in 2 digits form too).
The extractors to use are:
matchers.DateExtractor
for numeric (full or partial, English or French)
matchers.en.FullDateExtractor
for matchers.en.*_FULL
matchers.en.DateExtractor
for matchers.en.*
matchers.fr.FullDateExtractor
for matchers.fr.*_FULL
matchers.fr.DateExtractor
for matchers.fr.*
To extract dates from a String
:
import fr.splayce.rel.matchers
import matchers.DateExtractor.NUMERIC
import matchers.fr.{FullDateExtractor => FrDateExtr}
NUMERIC("2012-12-31").next // List(Y2012 Y12 M12 D31)
FrDateExtr("31 janvier 2013").next // List(Y2013 Y13 M01 D31)
We can also reuse those extractor in other extractors to get Lists of DateExtractor.Result
directly form a String
, Regex.Match
, MatchGroup
… DateExtratorSpec
contains several examples and use cases.
The matchers
package also provides a few utility functions to help build other matchers. The escaped
and unescaped
functions build a RE
expression to match an escaped (resp. unescaped) sub-expression, i.e. preceded by an even (resp. odd) number of the escape string.
This chapter shows you how to recursively rewrite a REL expression, and how to use Flavor
s to express your regex on other flavors/languages than Scala/Java.
An advantage to having a manipulable expression tree, other than reusing components, is that you can transform them as you please.
REL offers a way to do such manipulation quite simply using Scala’s powerful pattern matching. By passing a Rewriter
to a RE
object’s map
method, you can recursively rewrite this object’s subtree. A Rewriter
is actually a PartialFunction[RE, RE]
.
For example, we have a regex matching and capturing a UUID in its canonical form (lowercase hexadecimal, 8-4-4-4-12 digits). It is then used in a more complex expression as a capturing group.
val s = RE("-")
val h = RE("[0-9a-f]")
val uuid = h{8} - s - h{4} - s - h{4} - s - h{4} - s - h{12}
val complexExpression = /* … */ a ~ (uuid \ "uuid1") ~
b ~ (uuid \ "uuid2") ~ c /* … */
Say we want to match a complexExpression
elsewhere, without capturing the uuid. We can just transform capturing our capturing "uuid"
groups into non-capturing groups:
val toOther: Rewriter = {
case Group(_, uuid, _) => uuid.ncg
}
val other = complexExpression map toOther
Now, say we want uppercase hexadecimal in this expression, h
is being also used in other places than uuid
. We can complete our Rewriter
:
val H = RE("[0-9A-F]")
val toOther: Rewriter = {
case `h` => H
case Group(_, uuid, _) => uuid.ncg
}
val other = complexExpression map toOther
Other languages and tools have other regex flavors, with (sometimes subtle) differences in implementation and additional or lacking features (with respect to Java’s regex flavor). If we want to use our regexes in other flavors, we can apply some transformation to obtain compatible regexes (up to a point, the limit being unimplemented, unreplicable features).
\w
should match all letters, including diacritics (accented letters). Thus, DotNETFlavor
will transform \w
s (when used with μ
/Word
) into [a-zA-Z0-9_]
to avoid unwanted surprises.
DotNETFlavor
therefore changes a++
into the equivalent expression (?>a+)
.
JavaScriptFlavor
mimics a++
(or (?>a+)
) with (?=(a+))\1
. It is a stretch, since it add a possibly undesired capturing group, but it’s still better than no support.
JavaScriptFlavor
will throw an IllegalArgumentException
when you try to convert an expression containing a look-behind.
DotNETFlavor
(as well as the Java7Flavor
) inlines the group names for capture ((?<name>expr)
) and reference (\k<name>
).
Flavor
s expose two main methods: .express(re: RE)
and .translate(re: RE)
. The first one returns a Tuple2[String, List[String]]
, whose first element is the translated regex string and whose second is a list of the group names (in order of appearance) allowing you to perform a mapping to capturing group indexes (like Scala does) if needed. The second method only performs the translation of a RE
term into another.
The following flavors are bundled with REL:
For example, to express a regex in the .NET regex flavor:
val myRegex = ^^ - (α.++ \ "firstWord")
DotNETFlavor.translate(myRegex) // approximately* ^^ - (?>(α.+) \ "firstWord")
DotNETFlavor.express(myRegex)._1 === "\A(?<firstWord>(?>[a-zA-Z]+))"
DotNETFlavor.express(myRegex)._2.toString === "List(firstWord)"
* approximately because the named capturing group will also have an inline naming strategy (for which there is no short DSL syntax, thus skipped here for the sake of simplicity)
But Flavors are not limited to other regex implementations. You can define your own for various uses, e.g.:
RE
tree
Cleaner
is really just a case class around a String => String
function. It is aimed to help pre-processing text before matching; its usage is completely optional. It also holds some utility methods to ease composing and instantiation.
You create a Cleaner
simply by giving it a function:
val lineBreakNormalizer = Cleaner(_.replaceAllLiterally("\r\n", "\n");
There is a shorthand for regex replacement:
val stripMultipleDashes = Cleaner.regexReplaceAll("--+".r, "-")
A Cleaner
extends Function[String, String]
, you use it like any other function, either someCleaner(someString)
or someCleaner.apply(someString)
.
The most readable/familiar form of composing is using unix-like pipes, intuitively applied from left to right:
val myCleaner = lineBreakNormalizer | TrimFilter | LowerCaseFilter
The pros of heavy cleaning when you can afford it is to match more variations (accents, case sensitivity, double spaces…) with simpler (and possibly faster) regexes. The cons are an upfront performance cost (not necessarily worse than a more complex/permissive regex) and more importantly matching on an altered text, making it harder to locate matches in the original text. This can be addressed by TrackString
(covered later in this chapter) but at an additional performance cost.
In the built-in Cleaners, the naming convention follow these rules of thumb:
*Normalizer
normalizes variations of the same things
*Cleaner
cleans up information that is irrelevant for the task at hand
*Filter
transforms the text to prepare it for matching
The bundled Cleaners are:
Name | Usage |
---|---|
IdentityCleaner |
Utility no-op Cleaner |
CamelCaseSplitFilter |
Split CamelCase words; follows the the form aBc (lower-upper-lower): will split someWords but not iOS nor VitaminC |
LowerCaseFilter |
Transform the text in lowercase; a lowercase-only regexes will often outperform case-insensitive |
LineSeparatorNormalizer |
Normalize all Unicode line breaks and vertical tabs to ASCII new line U+000A / \n |
WhiteSpaceNormalizer |
Normalize all Unicode spaces and horizontal tabs to ASCII spaces U+0020 |
WhiteSpaceCleaner |
Replace multiple instances of regular whitespaces (\s+ ) by a single space (strip line breaks) |
AllWhiteSpaceCleaner |
Replace multiple instances of all Unicode whitespaces by a single space (strip line breaks) |
SingleQuoteNormalizer |
Normalize frequent Unicode single quote / apostrophe variations (like prime of curved apostrophe) to ASCII straight apostrophe U+0027 / ’ |
DoubleQuoteNormalizer |
Normalize frequent Unicode double quote variations to ASCII quotation mark U+0022 / " |
QuoteNormalizer |
Combines SingleQuoteNormalizer and DoubleQuoteNormalizer |
DiacriticCleaner |
Pseudo ASCII folding, remove diacritical marks (like accents) and some common Unicode variants on Latin characters |
FullwidthNormalizer |
Normalize CJK Fullwidth Latin characters to their ASCII equivalents |
You can of course create your own Cleaners.
If your cleaning operation can fit in a single regex replacement:
object HtmlTagCleaner extends Cleaner(
Cleaner.regexReplaceFirst("<html[^>]*+>(.*)</html>", "$1"))
// or val htmlTagCleaner: Cleaner = Cleaner.regexReplaceFirst(…)
Same for multiple regex replacement:
object HtmlCommentsCleaner extends Cleaner(
Cleaner.regexReplaceAll("<!--(.*)-->", ""))
String => String
transformation.
TrackStrings are strings that can, to a certain extent, keep track of the shifts in positions. You pass Strings through Cleaners / regex replacement and remain able to get the position (or the best estimated range) in the original string of a [group of] character[s] that have moved in the resulting string:
import fr.splayce.rel.util.TrackString
val ts = TrackString("Test - OK - passed")
.replaceAll(" - ".r, " ") // "Test OK passed"
ts.srcPos(5, 7) // Interval [7,9) was the original position of "OK"
ts.srcPos(8, 14) // Interval [12,18) was the original position of "passed"
And Cleaners support TrackStrings, so this allows you to:
import fr.splayce.rel.cleaners.CamelCaseSplitFilter
val os = "MySuperClass"
val ts = CamelCaseSplitFilter(TrackString(os)) // "My Super Class"
val m = "Super".r.findFirstMatchIn(ts.toString).get
val op = ts.srcPos(m.start, m.end) // Interval [2,7)
val highlight = os.substring(0, op.start)
+ "<strong>" + os.substring(op.start, op.end) + "</strong>"
+ os.substring(op.end)
// "My<strong>Super</strong>Class"
Please note that, while Cleaners made with Cleaner.regexReplaceFirst
and Cleaner.regexReplaceAll
automatically support position tracking, you will have to implement apply: TrackString => TrackString
if you implement your own cleaners, e.g. by calling TrackString.edit
(see the API doc). Otherwise, the TrackString will see your string transformation as one big replacement – i.e. it will tell you that the original position of any character is somewhere between the beginning and the end of your original string, which admittedly isn’t of much help.
REL version number follows the Semantic Versioning 2.0 Specification. In the current early stage of development, the API is still unstable and backward compatibility may break.
As an additional rule, in version 0.Y.Z
, a Z
-only version change is expected to be backward compatible with previous 0.Y.*
versions. But a Y
version change potentially breaks backward compatibility.
There is no representation in the DSL for specific character ranges nor raw strings.
The string primitives are not parsed (use esc(str)
to escape a string that should be matched literally). Hence:
+
, ?
or (
, even in RECst
. Use esc(str)
to escape the whole string.
\w
passed in a string (as opposed to used with Word
/μ
) will not be translated by the DotNETFlavor
.
The Group names are checked but not inlined silently if they fail the validation, or if they are duplicated when the flavor requires unicity.
\uXXXX
is not supported by PCRE, yet not translated by PCREFlavor
so far.
JavaScript regexes are quite limited and work a bit differently. In JavaScript flavor:
WordBoundary
/\b
is kept as-is, but will not have exactly the same semantic because of the lack of Unicode support in JavaScript regex flavor. For instance, in "fiancé"
, Javascript sees "\bfianc\bé"
where most other flavors see "\bfiancé\b"
. Same goes for NotWordBoundary
/\B
.
InputBegin
(^^
) and InputEnd
($$
) are translated to LineBegin
(^
) and LineEnd
($
), but this is only correct if the m
(multiline) flag is off.
Not all Unicode ligatures and variations are known to DiacriticCleaner
, for example:
U+1F100-U+1F1FF
(Unicode 6.1)
U+3300-U+33FF
(Unicode 6.0)
U+A720-U+A7FF
(Unicode 5.1 to 6.1)
Regex replacement in TrackString
do not support Java 7 embedded group names, which are not accessible in Scala’s Match
yet. It will use Scala group names instead (inconsistent with String#replaceAll
).
TrackString
cannot track intertwined/reordered replacements, i.e. you can only track abc
=> bca
as a single group (as opposed to three reordered groups). If out-of-order Repl
/Subst
are introduced, srcPos
will most probably yield incorrect results.
The following would be useful:
Core
[^...]
)
'symbols
for group names
Matchers
Utils