Regular Expressions¶
The regular expression type in Lasso allows for powerful search and replace operations on strings and byte streams. This chapter details how the regular expression type works and other Lasso methods that use regular expressions.
Regular Expression Structure¶
A regular expression is a pattern that describes a sequence of characters that you want to search for in a target (or input) string. Regular expressions consist of letters or digits that simply match themselves, wildcards that match any character in a class such as whitespace or digits, and combining symbols that expand wildcards to match several characters rather than just one. Lasso uses the ICU Regular Expressions package for its support of regular expressions.
Note
Full documentation of regular expression methodology is outside the scope of this guide. Consult a standard reference on regular expressions for more information about how to use this flexible technology.
Basic Matchers¶
The simplest regular expression is just a pattern containing letters or digits.
The regular expression bird
is said to match the string “bird”. The
regular expression 123
matches the string “123”. The regular
expression is matched against an input string by comparing each character in the
regular expression to each character in the input string, one after another.
Regular expressions are normally case-sensitive so the regular expression
John
would not match the string “john”.
Unicode characters within a regular expression work the same as any other
character. The escape sequence \u2FB0
with the four-digit hex value for a
Unicode character can also be used in place of any actual character (within
regular expressions or any Lasso strings). The escape sequence \u2FB0
represents a Chinese character.
Regular expressions can also match part of a string. The regular expression
bird
is found starting at position 3 in the string “A bird in the
hand”.
A regular expression can contain wildcards that match one of a set of
characters. [Jj]
is a wildcard which matches either an uppercase “J” or a
lowercase “j”. The regular expression [Jj]ohn
will match either the
string “John” or the string “john”. The wildcard [aeiou]
matches any
lowercase vowel. The wildcard [a-z]
matches any lowercase roman letter. The
wildcard [0-9]
matches any arabic digit. The wildcard [a-zA-Z]
matches
any uppercase or lowercase roman letter. If a Unicode character is used in a
character range then any characters between the hex value for the two characters
are matched. The wildcard [\u2FB0-\u2FBF]
will match 16 different Chinese
characters.
The period (.
) is a special wildcard that matches any single character. The
regular expression ..
would match any two-character string including
“be”, “12”, or even “ ” (two spaces). The period will match any
ASCII or Unicode character including punctuation or most whitespace characters.
It will not match return or newline characters.
A number of other predefined wildcards are available. The predefined wildcards
are all preceded by a backslash (\
).
Many of the predefined wildcards come in pairs. The wildcard \s
matches any
whitespace character including tabs, spaces, returns, or newlines. The wildcard
\S
matches any non-whitespace character. The wildcard \w
matches any
alphanumeric character or underscore. The “w” is said to stand for “word” since
these are all characters that may appear within a word. The wildcard \W
matches non-word characters. The wildcard \d
matches any arabic digit and
the wildcard \D
matches any non-digit. For example, the regular expression
\w\w\w
would match any three-character word such as “cat” or “dog”.
The regular expression \d\d\d-\d\d\d\d-\d\d\d\d
would match
a standard North American phone number in the form “360-555-1212”.
The predefined wildcards only work on standard ASCII strings. There is a special
pair of wildcards \p
and \P
that allow matching different characters in
a Unicode string. The wildcard is specified as \p{Property}
. A list of
properties can be found in the table below. For example, the wildcard \p{L}
matches any Unicode letter character, the wildcard \p{N}
matches any Unicode
digit, and the wildcard \p{P}
matches any Unicode punctuation characters.
The \P{Property}
wildcard is the opposite. \P{L}
matches any Unicode
character that is not a letter.
Many characters have special meanings in regular expressions including [ ] ( )
{ } . * + ? ^ $ \ |
. In order to match one of these characters literally it is
necessary to use a backslash in front of it, e.g. \[
matches a literal
opening square bracket rather than starting a character range.
It is important to remember that double- or single-quoted string literals use a
backslash for escape sequences, so a double backslash is required to use the
predefined wildcards and to escape special characters. You can avoid having to
use a double backslash by specifying the regular expression using ticked string
literals. However, the use of ticked string literals makes it difficult to match
common escape sequences such as returns (\r
) or newlines (\n
). It is
recommended that you use ticked string literals for all of your regular
expressions until you need one of these escape sequences, and then that you
concatenate in a non-ticked string literal for these sequences. For example, the
following string concatenation would create a regular expression that matches a
letter followed by a tab followed by a digit:
local(my_regexp) = `\w` + "\t" + `\d`
Basic Matching Strings¶
Below is a listing of basic matchers and a brief definition. Matches are case-sensitive by default. Be sure to note whether quoted or ticked string literals are being used.
`.`
- Period matches any single character except a line break.
`[ ]`
- Character class. Matches any character contained between the square brackets.
`[^ ]`
- Character exception class. Matches any character that is not contained between the square brackets.
`[a-z]`
- Lowercase character range. Matches any character between the two specified.
`[A-Z]`
- Uppercase character range.
`[a-zA-Z]`
- Combination character range matching any letter.
`[0-9]`
- Numeric character range.
"\t"
- Matches a tab character.
"\r"
- Matches a return character.
"\n"
- Matches a newline character.
`"`
- Matches a double quote.
`'`
- Matches a single quote.
`\x##`
- Matches a single ISO-8859-1 character. The number signs should be replaced with the 2-digit hex value for the character.
`\u####`
- Matches a single Unicode character. The number signs should be replaced with the 4-digit hex value (code point) for the Unicode character.
`\p{ }`
- Matches a single Unicode character with the stated property. (The available properties are listed next.)
`\P{ }`
- Matches a single Unicode character that does not have the stated property. (The available properties are listed next.)
`\w`
- Matches an alphanumeric “word” character, including underscores.
`\W`
- Matches a non-alphanumeric character (whitespace or punctuation).
`\s`
- Matches a blank, whitespace character. Equivalent to
[\t\n\f\r\p{Z}]
. `\S`
- Matches a non-blank, non-whitespace character.
`\d`
- Matches a digit character. Equivalent to
[0-9]
. `\D`
- Matches a non-digit character.
`\`
- Escapes the next character. This allows any symbol to be specified as a
matching character including the reserved characters
[ ] ( ) { } . * + ? ^ $ \ |
.
The following table lists the property symbols that can be used with the \p
and \P
wildcards. The main symbol (e.g. \p{L}
) will match all of the
characters that are matched by each of the variants.
Symbol | Property | Variants | Description |
---|---|---|---|
L |
letter | Lu |
Uppercase Letter |
Ll |
Lowercase Letter | ||
Lt |
Titlecase Letter | ||
Lm |
Modifier Letter | ||
Lo |
Other Letter | ||
N |
number | Nd |
Decimal Digit Number |
Nl |
Letter Number | ||
No |
Other Number | ||
P |
punctuation character | Pc |
Connector Punctuation |
Pd |
Dash Punctuation | ||
Ps |
Open Punctuation | ||
Pe |
Close Punctuation | ||
Pi |
Initial Punctuation | ||
Pf |
Final Punctuation | ||
Po |
Other Punctuation | ||
S |
symbol | Sm |
Math Symbol |
Sc |
Currency Symbol | ||
Sk |
Modifier Symbol | ||
So |
Other Symbol | ||
Z |
separator (typically whitespace) | Zs |
Space Separator |
Zl |
Line Separator | ||
Zp |
Paragraph Separator | ||
M |
mark | Mn |
Non-Spacing Mark |
Mc |
Spacing Combining Mark | ||
Me |
Enclosing Mark | ||
C |
“other” character | Cc |
Control |
Cf |
Format | ||
Cs |
Surrogate | ||
Co |
Private Use | ||
Cn |
Not Assigned |
Combining Symbols¶
Combining symbols allow expanding wildcards to match entire substrings rather
than individual characters. For example, the wildcard [a-z]
matches one
lowercase letter and needs to be repeated three times to match a three letter
word [a-z][a-z][a-z]
. Instead, the combining symbol {3}
can be
used to specify that the preceding wildcard should be repeated three times
[a-z]{3}
.
The combining symbol +
matches one or more repetitions of the preceding
matcher. The expression [a-z]+
matches any string of lowercase letters. This
expression matches the strings “a”, “green”, or “international”. It does not
match “$1,544,897.00” because that string does not contain any lowercase
letters.
The combining symbol +
can be used with the .
wildcard to match any
string of one or more characters (.+
), with the wildcard \w
to match any
word (\w+
), or with the wildcard \s
to match one or more whitespace
characters (\s+
). The +
symbol can also be used with a simple letter to
match one or more repetitions of the letter. The regular expression
Me+t
matches both the string “Met” and the string “Meet”, not to
mention “Meeeeeet”.
The combining symbol *
matches zero or more repetitions of the preceding
matcher. The *
symbol can be used with the generic wildcard .
to match
any string of characters (.*
). The *
symbol can be used with the
whitespace wildcard \s
to match a string of whitespace characters. For
example, the expression \s*cat\s*
will match the string “cat”, but
also the string “ cat ”.
Braces are used to designate a specific number of repetitions of the preceding
wildcard. When the braces contain a single number they designate that the
preceding wildcard should be matched exactly that number of times. For example,
[a-z]{3}
matches any three lowercase letters. When the braces contain two
numbers they allow for any number of repetitions from the lower number to the
upper number. The pattern [a-z]{3,5}
matches any three to five lowercase
letters. If the second number is omitted then the braces function similarly to a
+
, e.g. [a-z]{3,}
matches any string of lowercase letters with a length
of 3 or longer.
The symbol ?
on its own makes the preceding matcher optional. For example,
the expression mee?t
will match either the string “met” or “meet”
since the second “e” is optional, but it won’t match “meeeet”.
When used after a +
, *
, or braces the ?
makes the match non-greedy.
Normally, a subexpression will match as much of the input string as possible.
The expression <.*>
will match a string that begins and ends with
angle brackets. It will match the entire string "<b>Bold Text</b>"
. With the
non-greedy option the expression <.*?>
will match the shortest string
possible. It will now match just the first part of the string "<b>"
and a
second application of the expression will match the last part of the string
"</b>"
.
+
- Matches 1 or more repetitions of the preceding symbol.
*
- Matches 0 or more repetitions of the preceding symbol.
?
- Makes the preceding symbol optional.
{n}
- Braces. Matches “n” repetitions of the preceding symbol.
{n,}
- Matches at least “n” repetitions of the preceding symbol.
{n,m}
- Matches at least “n”, but no more than “m” repetitions of the preceding symbol.
+?
- Non-greedy variant of the plus sign; matches the shortest string possible.
*?
- Non-greedy variant of the asterisk; matches the shortest string possible.
{ }?
- Non-greedy variant of braces; matches the shortest string possible.
Groupings¶
Groupings have two purposes in regular expressions: they allow designating portions of a regular expression as groups that can be used in a replacement pattern, and they allow building more complex regular expressions from simple regular expressions.
Parentheses are used to designate a portion of a regular expression as a
replacement group. Most regular expressions are used to perform find/replace
operations so this is an essential part of designing a pattern. Note that if
parentheses are meant to be a literal part of the pattern then they need to be
escaped as \(
and \)
. The regular expression <b>(.*?)</b>
matches an HTML bold tag. The contents of the tag are designated as a group. If
this regular expression is applied to the string "<b>Bold Text</b>"
then the
pattern matches the entire string and “Bold Text” is designated as the first
group.
Similarly, a phone number could be matched by the regular expression
((d{3})) (d{3})-(d{4})
with three groups. The first group
represents the area code (note that the parentheses appear in both escaped form
\( \)
to match literal opening and closing parentheses and normal form (
)
to designated a grouping). The second group represents the prefix and the
third group the subscriber number. When the regular expression is applied to the
string “(360) 555-1212” then the pattern matches the entire string and generates
the groups “360”, “555”, and “1212”.
Parentheses can also be used to create a subexpression that does not generate a
replacement group using (?:)
. This form can be used to create subexpressions
that function much like very complex wildcards. For example, the expression
(?:blue)+
will match one or more repetitions of the subexpression
“blue”. It will match the strings “blue”, “blueblue” or “blueblueblueblue”.
The |
symbol can be used to specify alternation. It is most useful when used
with subexpressions. The expression (?:blue)|(?:red)
will match either
the word “blue” or the word “red”.
( )
- Grouping for output. Defines a numbered group for output. Up to nine groups can be defined.
(?: )
- Grouping without output. Can be used to create a logical grouping that should not be assigned to an output.
|
- Alternation. Matches either the characters before or the characters after the symbol. May appear within a group to limit the alternation boundary.
Replacement Expressions¶
When regular expressions are used for find/replace operations the replacement
expression can contain placeholders into which the defined groups from the
search expression are substituted. The placeholder $0
represents the entire
matched string. The placeholders $1
through $9
represent the first nine
groupings as defined by parentheses in the regular expression.
The regular expression <b>(.*?)</b>
from above matches an HTML bold
tag with the contents of the tag designated as a group. The replacement
expression <em>$1</em>
will essentially replace the bold tags with
emphasis tags without disrupting the tags’ contents, e.g. the string
"<b>Bold Text</b>"
would become "<em>Bold Text</em>"
after such a
find/replace operation.
The phone number expression ((d{3})) (d{3})-(d{4})
from above
matches a phone number and creates three groups for the parts of the phone
number. The replacement expression $1-$2-$3
would rewrite the phone
number to be in a more standard format. For example, the string “(360) 555-1212”
would result in “360-555-1212” after a find/replace operation.
$0
–$9
- Names a group in the replace string.
$0
represents the entire matched string. Up to nine groups can be specified using the digits 1 through 9.
Tip
To place a literal $
in a replacement string, escape it as \$
.
Advanced Expressions¶
The ICU library also supports a number of more advanced symbols for special purposes. Some of these symbols are listed in the following table, but a reference on regular expressions should be consulted for full documentation of these symbols and other advanced concepts. A list of regular expression flags follows.
(?# )
- Regular expression comment. The contents are not interpreted as part of the regular expression.
(?i)
- Sets a flag to make the remainder of the regular expression case-insensitive.
Similar to specifying
-ignoreCase
. (?-i)
- Sets the remainder of the regular expression to be case-sensitive (the default).
(?i: )
- The contents of this group will be matched case-insensitive and the group will not be added to the output.
(?-i: )
- The contents of this group will be matched case-sensitive and the group will not be added to the output.
(?= )
- Positive lookahead assertion. The contents are matched following the current position, but not added to the output pattern.
(?! )
- Negative lookahead assertion. The same as above, but the content must not match following the current position.
(?<= )
- Positive lookbehind assertion. The contents are matched preceding the current
position, but not added to the output pattern. The length of possible strings
matched by lookbehinds cannot be unbounded (no
*
or+
operators). (?<! )
- Negative lookbehind assertion. The same as above, but the contents must not match preceding the current position.
`\b`
- Matches the boundary between a word and a space. Does not properly interpret
Unicode characters. The transition between any regular ASCII character
(matched by
\w
) and a Unicode character is seen as a word boundary. `\B`
- Matches a boundary not between a word and a space.
`\A`
- Matches the beginning of the input.
`\Z`
- Matches the end of the input.
`^`
- Matches the beginning of the input, or the line if the
m
flag is set. `$`
- Matches the end of the input, or the line if the
m
flag is set.
Regular Expression Flags¶
i
- Sets matching to be case-insensitive.
x
- Allows whitespace in comments and patterns.
s
- Allows the
.
character to also match line break characters. m
- Allows the characters
^
and$
to match the start and end of lines, respectively. By default these will only match at the start and end of the input. w
- Changes the behavior of
\b
so that word boundaries are defined according to Unicode Standard Annex #29.
Regexp Type¶
The regexp
type allows a regular expression to be defined once and then
reused many times. It facilitates simple search operations, splitting strings,
and interactive find/replace operations.
The regexp
type has some advantages over the string_…
methods that
perform regular expression operations. Performance can be increased by creating
a regular expression once and then reusing it multiple times. The type has a
number of member methods that allow access to the stored regular expressions and
input and output of strings, performing find/replace operations, or acting as
components in an interactive find/replace operation. These are described below.
Creating Regular Expression Objects¶
-
type
regexp
¶
-
regexp
(find::string, replace::string, input::string, ignorecase::boolean)
-
regexp
(find::string, replace::string=?, input::string=?, -ignoreCase::boolean=?)
-
regexp
(-find::string, -replace::string=?, -input::string=?, -ignoreCase::boolean=?) The
regexp
creator method creates a reusable regular expression. Aregexp
object must be initialized with a string regular expression pattern as either the first parameter or as the argument of a-find
keyword parameter. The type will also store a replacement pattern, and input string passed as either the second and third parameters or specified with the-replace
or-input
keyword parameter, respectively. These can be overridden with particular member methods. The type also has an-ignoreCase
option which controls whether regular expressions are applied with case sensitivity or not.A regular expression can be created that explicitly specifies the find pattern, replacement pattern, input string, and optionally with the
-ignoreCase
option. Using a fully qualified regular expression that is output to the page (rather than being stored in a variable) is an easy way to perform a quick find/replace operation.regexp(`[aeiou]`, 'x', 'The quick brown fox jumped over the lazy dog.')->replaceAll // => Thx qxxck brxwn fxx jxmpxd xvxr thx lxzy dxg.
However, a regular expression will usually be stored in a variable and then later run against an input string. The following code stores a regular expression with a find and replace pattern into the variable “my_regexp”. The following section Simple Find/Replace and Split Methods will show how this regular expression can be applied to strings.
local(my_regexp) = regexp(-find=`[aeiou]`, -replace=`x`, -ignoreCase)
-
regexp->
findPattern
()¶ Returns the find pattern.
-
regexp->
replacePattern
()¶ Returns the replacement pattern.
-
regexp->
input
()¶ Returns the input string.
-
regexp->
ignoreCase
()¶ Returns “true” if the
-ignoreCase
flag has been set, otherwise returns “false”.
-
regexp->
groupCount
()¶ Returns an integer specifying how many groups were found in the find pattern.
-
regexp->
output
()¶ Returns the output string.
For example, the regular expression above can be inspected by the following code. The group count is “0” since the find expression does not contain any groups (designated by parentheses):
'FindPattern: ' + #my_regexp->findPattern + '\n'
'ReplacePattern: ' + #my_regexp->replacePattern + '\n'
'IgnoreCase: ' + #my_regexp->ignoreCase + '\n'
'GroupCount: ' + #my_regexp->groupCount + '\n'
// =>
// FindPattern: [aeiou]
// ReplacePattern: x
// IgnoreCase: true
// GroupCount: 0
Simple Find/Replace and Split Methods¶
The regexp
type provides two member methods that perform a find/replace
on an input string and one method that splits an input string into an array.
These methods are documented with examples below, and are shortcuts for longer
operations that can be performed using the interactive methods described in the
next section.
-
regexp->
replaceAll
(replace::string)¶
-
regexp->
replaceAll
(-input=?, -find=?, -replace=?, -ignoreCase=?) The first listed incarnation of this method allows changing the replacement string. The second will replace all occurrences of the current find pattern with the current replacement pattern. The
-input
parameter specifies what string should be operated on. If no input is provided then the input stored in the regular expression object is used. If desired, new-find
and-replace
patterns can also be specified within this method along with the-ignoreCase
flag.
-
regexp->
replaceFirst
(-input=?, -find=?, -replace=?, -ignoreCase=?)¶ Replaces the first occurrence of the current find pattern with the current replacement pattern. The
-input
parameter specifies what string should be operated on. If no input is provided then the input stored in the regular expression object is used. If desired, new-find
and-replace
patterns can also be specified within this method along with the-ignoreCase
flag.
-
regexp->
split
(-input=?, -find=?, -replace=?, -ignoreCase=?)¶ Splits the string using the regular expression as a delimiter and returns a staticarray of substrings. The
-input
parameter specifies what string should be operated on. If no input is provided then the input stored in the regular expression object is used. If desired, new-find
and-replace
patterns can also be specified within this method along with the-ignoreCase
flag.
Use the Same Regular Expression on Multiple Inputs¶
The same regular expression can be used on multiple inputs by first creating the
regular expression using one of the regexp
creator methods and then calling
regexp->replaceAll
with a new -input
as many times as necessary. Since the
regular expression is only created once this technique can be considerably
faster than using the string_replaceRegExp
method repeatedly.
local(my_regexp) = regexp(-find=`[aeiou]`, -replace=`x`, -ignoreCase)
#my_regexp->replaceAll(-input='The quick brown fox jumped over the lazy dog.')
#my_regexp->replaceAll(-input='Lasso Server')
// =>
// Thx qxxck brxwn fxx jxmpxd xvxr thx lxzy dxg.
// Lxssx Sxrvxr
The replace pattern can also be changed if necessary. The following code changes both the input and replace patterns each time the regular expression is used:
local(my_regexp) = regexp(-find=`[aeiou]`, -replace=`x`, -ignoreCase)
#my_regexp->replaceAll(-input='The quick brown fox jumped over the lazy dog.', -replace=`y`)
#my_regexp->replaceAll(-input='Lasso Server', -replace=`z`)
// =>
// Thy qyyck brywn fyx jympyd yvyr thy lyzy dyg.
// Lzssz Szrvzr
The replacement pattern can reference groups from the input using $1
through
$9
. The following example uses a regular expression to clean up the
formatting on a couple of telephone numbers:
local(my_regexp) = regexp(`\((\d{3})\) (\d{3})-(\d{4})`, `$1-$2-$3`)
#my_regexp->replaceAll(-input='(360) 555-1212')
#my_regexp->replaceAll(-input='(800) 555-1212')
// =>
// 360-555-1212
// 800-555-1212
Split a String Using a Regular Expression¶
The regexp->split
method can split a string using a regular expression as the
delimiter. This allows strings to be split into parts using sophisticated
criteria. For example, rather than splitting a string on a comma, the “and”
before the last item can be taken into account. Or, rather than splitting a
string on space, the string can be split into words taking punctuation and other
whitespace into account.
The same regular expression from the example above can split a string into substrings. In this case the string will be split on vowels, generating a staticarray with elements containing only consonants or spaces:
local(my_regexp) = regexp(-find=`[aeiou]`, -replace=`x`, -ignoreCase)
#my_regexp->split(-input='The quick brown fox jumped over the lazy dog.')
// => staticarray(Th, q, , ck br, wn f, x j, mp, d , v, r th, l, zy d, g.)
The -find
pattern can be modified in-place within the regexp->split
method
to split the string on a different regular expression. In this example the
string is split on any one of one or more non-word characters. This splits the
string into words not including any whitespace or punctuation.
#my_regexp->split(-find=`\W+`, -input='The quick brown fox jumped over the lazy dog.')
// => staticarray(The, quick, brown, fox, jumped, over, the, lazy, dog)
If the -find
expression contains groups then they will be returned in the
array in between the split elements. For example, surrounding the -find
pattern above with parentheses will result in an array of alternating word
elements and whitespace/punctuation elements.
#my_regexp->split(-find=`(\W+)`, -input='The quick brown fox jumped over the lazy dog.')
// => staticarray(The, , quick, , brown, , fox, , jumped, , over, , the, , lazy, , dog, .)
Interactive Find/Replace Methods¶
The regexp
type provides a collection of member methods that make
interactive find/replace operations possible. Interactive in this case means
that Lasso code can intervene in each replacement as it happens. Rather than
performing a simple one-shot find/replace like those shown in the last section,
it is possible to programmatically determine the replacement strings using
database searches or any logic.
The order of operations of an interactive find/replace operation is as follows:
The regular expression object is initialized with a
-find
pattern and-input
string. In this example the find pattern will match each word in the input string in turn:local(my_regexp) = regexp( -find=`\w+`, -input='The quick brown fox jumped over the lazy dog.', -ignoreCase )
A
while
loop is used to advance the regular expression match withregexp->find
. Each time through the loop the pattern is advanced one match forward. If there are no further matches then the method returns “false” and the loop is exited:while(#my_regexp->find) => { // ... }
Within the
while
loop theregexp->matchString
method is used to inspect the current match. If the find pattern had groups then they could be inspected here by passing an integer parameter toregexp->matchString
:local(match) = #my_regexp->matchString
The match is manipulated. For this example the match string will be reversed using the
string->reverse
method. This will reverse the word “lazy” to be “yzal”:#match->reverse
The modified match string is now appended to the output string using the
regexp->appendReplacement
method. This method will automatically append any parts of the input string that weren’t matched (the spaces between the words):#my_regexp->appendReplacement(#match)
After the
while
loop theregexp->appendTail
method is used to append the unmatched end of the input string to the output (the period at the end of the example input):#my_regexp->appendTail
Finally, the output string from the regular expression object is displayed:
#my_regexp->output // => ehT kciuq nworb xof depmuj revo eht yzal god.
This same basic order of operation is used for any interactive find/replace operation. The power of this methodology comes in the fourth step where the replacement string can be generated using any code necessary, rather than needing to be a simple replacement pattern.
-
regexp->
find
(position::integer=?)¶ Advances the regular expression one match in the input string. Returns “true” if the regular expression was able to find another match, otherwise returns “false”. Defaults to checking from the start of the input string (or from the end of the most recent match), but an optional integer parameter can be passed to set the position in the input string at which to start the search.
-
regexp->
matchString
(group::integer=?)¶ Returns a string containing the last pattern match. An optional integer parameter specifies a group from the find pattern to return, defaulting to returning the entire pattern match.
-
regexp->
matchPosition
(group::integer=?)¶ Returns a pair containing the start position and length of the last pattern match. An optional integer parameter specifies a group from the find pattern to return, defaulting to returning information about the entire pattern match.
-
regexp->
appendReplacement
(pattern::string)¶ Performs a replace operation on the current pattern match and appends the result onto the output string. Requires a single parameter specifying the replacement pattern including group placeholders
$0
–$9
. Automatically appends any unmatched runs from the input string.
-
regexp->
appendTail
()¶ The final step in an interactive find/replace operation. Appends the final unmatched run from the input string into the output string.
-
regexp->
reset
(-input=?, -find=?, -replace=?, -ignoreCase=?)¶ Resets the object. If called with no parameters, the input string is set to the output string. Accepts optional
-find
,-replace
,-input
, and-ignoreCase
parameters.
-
regexp->
matches
(position::integer=?)¶ Returns “true” if the pattern matches the entire input string. An optional integer parameter sets the position in the input string at which to start the search.
-
regexp->
matchesStart
(position::integer=?)¶ Returns “true” if the pattern matches a substring of the input string. Defaults to checking the start of the input string. An optional integer parameter sets the position in the input string at which to start the search.
Perform an Interactive Find/Replace Operation¶
This example searches for variable names with a dollar sign in an input string and replaces them with variable values. An interactive find/replace operation is used so that the existence of each variable can be checked dynamically as the string is processed.
The string has several words replaced by variable references and each replacement is defined with a replacement word in a map.
local(my_string) = 'The quick $color fox $verb over the lazy $animal.'
local(replacements) = map(
'color' = "red",
'verb' = "soared",
'animal' = "ocelot"
)
A regular expression is initialized with the input string and a pattern that
looks for words beginning with a dollar sign. The word itself is defined as a
group within the find pattern. A while
loop uses regexp->find
to advance
through all the matches in the input string. The method regexp->matchString
with a parameter of “1” returns the map key for each match. If this key exists
then its value is substituted back into output string using
regexp->appendReplacement
, otherwise, the full match is substituted back into
the output string with the replacement pattern $0
. Finally, any remaining
unmatched input string is appended to the end of the output string using
regexp->appendTail
.
local(my_regexp) = regexp(-find=`\$(\w+)`, -input=#my_string, -ignoreCase)
while(#my_regexp->find) => {
#my_regexp->appendReplacement(
#replacements->find(#my_regexp->matchString(1)) or `$0`
)
}
#my_regexp->appendTail
After the operation has completed the output string is displayed:
#my_regexp->output
// => The quick red fox soared over the lazy ocelot.
String Methods Taking Regular Expressions¶
The string_findRegExp
and string_replaceRegExp
methods can perform regular
expression find and replace routines on text strings.
-
string_findRegExp
(input, -find::string, -ignoreCase=?)¶ Requires two parameters: a string value and a
-find
keyword parameter. Returns an array with each instance of the-find
regular expression in the string parameter. An optional-ignoreCase
parameter uses case-insensitive patterns.
-
string_replaceRegExp
(input, -find::string, -replace::string, -ignoreCase=?, -replaceOnlyOne=?)¶ Requires three parameters: a string value, a
-find
keyword parameter, and a-replace
keyword parameter. Returns an array with each instance of the-find
regular expression replaced by the value of the-replace
string parameter. An optional-ignoreCase
parameter uses case-insensitive parameters, and an optional-replaceOnlyOne
parameter replaces only the first pattern match.
Matching Patterns Using string_findRegExp¶
The string_findRegExp
method returns an array of items that match the
specified regular expression within the string. The array contains the full
matched string in the first element, followed by each of the matched
subexpressions in subsequent elements.
In the following example, every email address in a string is returned in an array:
string_findRegExp(
'Send email to address@example.com.',
-find=`\w+@\w+\.\w+`
)
// => array(address@example.com)
In the following example, every email address in a string is returned in an array and subexpressions are used to divide the username and domain name portions of the email address. The result is an array with the entire match string, then each of the subexpressions.
string_findRegExp(
'Send email to address@example.com.',
-find=`(\w+)@(\w+\.\w+)`
)
// => array(address@example.com, address, example.com)
In the following example, every word in the source is returned in an array. The first character of each word is separated as a subexpression. The returned array contains 16 elements, one for each word in the source string and one for the first character subexpression of each word in the source string.
string_findRegExp(
`The quick brown fox jumped over a lazy dog.`,
-find=`(\w)\w*`
)
// => array(The, T, quick, q, brown, b, fox, f, jumped, j, over, o, a, a, lazy, l, dog, d)
The resulting array can be divided into two arrays using the following code.
This code loops through the array (stored in result_array
) and places the
odd elements in the array word_array
and the even elements in the array
char_array
.
local(word_array, char_array) = (: array, array)
local(result_array) = string_findRegExp(
`The quick brown fox jumped over a lazy dog.`,
-find=`(\w)\w*`
)
with key in #result_array->keys
let value = #result_array->get(#key)
do {
if(#key % 2 == 0) => {
#char_array->insert(#value)
else
#word_array->insert(#value)
}
}
#word_array
// => array(The, quick, brown, fox, jumped, over, a, lazy, dog)
#char_array
// => array(T, q, b, f, j, o, a, l, d)
In the following example, every phone number in a string is returned in an
array. The \d
symbol is used to match individual digits and the {3}
symbol is used to specify that three repetitions must be present. The
parentheses are escaped \(
and \)
so they aren’t treated as grouping
characters.
string_findRegExp(
'Phone (800) 555-1212 for information.',
-find=`\(\d{3}\) \d{3}-\d{4}`
)
// => array((800) 555-1212)
In the following example, only words contained between HTML bold tags are returned. Positive lookahead and lookbehind assertions are used to find the contents of the tags without the tags themselves. Note that the pattern inside the assertions uses a non-greedy modifier.
string_findRegExp(
'This is some <b>sample text</b>!',
-find=`(?<=<b>).+?(?=</b>)`
)
// => array(sample text)
Replacing Values Using string_replaceRegExp¶
In the following example, every occurrence of the world “Blue” in the string is
replaced by the HTML code <span style="color: blue;">Blue</span>
so that the
word “Blue” appears in blue on the web page. The -find
parameter is
specified so either a lowercase or uppercase “b” will be matched. The
-replace
parameter references $1
to insert the actual value matched into
the output.
string_replaceRegExp(
'Blue Lake sure is blue today.',
-find=`([Bb]lue)`,
-replace=`<span style="color: blue;">$1</span>`
)
// => <span style="color: blue;">Blue</span> Lake sure is <span style="color: blue;">blue</span> today.
In the following example, every email address is replaced by an HTML anchor tag
that links to the same email address. The \w
symbol is used to match any
alphanumeric characters or underscores. The at sign (@
) matches itself. The
period must be escaped (\.
) in order to match an actual period and not just
any character. This pattern matches any email address of the format
name@example.com
:
string_replaceRegExp(
'Send email to address@example.com.',
-find=`(\w+@\w+\.\w+)`,
-replace=`<a href="mailto:$1">$1</a>`
)
// => Send email to <a href="mailto:address@example.com">address@example.com</a>.