regex {base} | R Documentation |
This help page documents the regular expression patterns supported by
grep
and related functions regexpr
,
gregexpr
, sub
and gsub
, as well as by
strsplit
.
A ‘regular expression’ is a pattern that describes a set of
strings. Three types of regular expressions are used in R,
extended regular expressions, used by
grep(extended = TRUE)
(its default), basic regular
expressions, as used by grep(extended = FALSE)
, and
Perl-like regular expressions used by grep(perl = TRUE)
.
Other functions which use regular expressions (often via the use of
grep
) include apropos
, browseEnv
,
help.search
, list.files
, ls
and strsplit
.
These will all use extended regular expressions, unless
strsplit
is called with argument extended = FALSE
or
perl = TRUE
.
Patterns are described here as they would be printed by cat
: do
remember that backslashes need to be doubled in entering R character
strings from the keyboard.
This section covers the regular expressions allowed if extended
= TRUE
in grep
, regexpr
, gregexpr
, sub
,
gsub
and strsplit
. They use the glibc 2.3.5
implementation of the POSIX 1003.2 standard.
Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions.
The fundamental building blocks are the regular expressions that match
a single character. Most characters, including all letters and
digits, are regular expressions that match themselves. Any
metacharacter with special meaning may be quoted by preceding it with
a backslash. The metacharacters are . \ | ( ) [ { ^ $ * + ?
.
A character class is a list of characters enclosed by [
and ]
which matches any single character in that list; if the first
character of the list is the caret ^
, then it matches any
character not in the list. For example, the regular expression
[0123456789]
matches any single digit, and [^abc]
matches
anything except the characters a
, b
or c
. A range
of characters may be specified by giving the first and last characters,
separated by a hyphen. (Character ranges are interpreted in the
collation order of the current locale.)
Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation below is that of the POSIX locale.
[:alnum:]
[:alpha:]
and [:digit:]
.[:alpha:]
[:lower:]
and
[:upper:]
.[:blank:]
[:cntrl:]
DEL
). In another character set,
these are the equivalent characters, if any.[:digit:]
0 1 2 3 4 5 6 7 8 9
.[:graph:]
[:alnum:]
and
[:punct:]
.[:lower:]
[:print:]
[:alnum:]
, [:punct:]
and space.[:punct:]
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
.[:space:]
[:upper:]
[:xdigit:]
0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
.
For example, [[:alnum:]]
means [0-9A-Za-z]
, except the
latter depends upon the locale and the character encoding, whereas
the former is independent of locale and character set. (Note that the
brackets in these class names are part of the symbolic names, and must
be included in addition to the brackets delimiting the bracket list.)
Most metacharacters lose their special meaning inside lists. To
include a literal ]
, place it first in the list. Similarly, to
include a literal ^
, place it anywhere but first. Finally, to
include a literal -
, place it first or last. (Only these and
\
remain special inside character classes.)
The period .
matches any single character. The symbol
\w
is documented to be synonym for [[:alnum:]]
and
\W
is its negation. However, \w
also
matches underscore in the GNU grep code used in R.
The caret ^
and the dollar sign $
are metacharacters
that respectively match the empty string at the beginning and end of a
line. The symbols \<
and \>
respectively match the
empty string at the beginning and end of a word. The symbol \b
matches the empty string at the edge of a word, and \B
matches
the empty string provided it is not at the edge of a word.
A regular expression may be followed by one of several repetition quantifiers:
?
*
+
{n}
n
times.{n,}
n
or more
times.{n,m}
n
times, but not more than m
times.Repetition is greedy, so the maximal possible number of repeats is used.
Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.
Two regular expressions may be joined by the infix operator |
;
the resulting regular expression matches any string matching either
subexpression. For example, abba|cde
matches either the
string abba
or the string cde
. Note that alternation
does not work inside character classes, where |
has its literal
meaning.
Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole subexpression may be enclosed in parentheses to override these precedence rules.
The backreference \N
, where N is a single digit, matches the
substring previously matched by the Nth parenthesized subexpression of
the regular expression.
Before R 2.1.0 R attempted to support traditional usage by assuming
that {
is not special if it would be the start of an invalid
interval specification. (POSIX allows this behaviour as an extension but
we no longer support it.)
This section covers the regular expressions allowed if extended
= FALSE
in grep
, regexpr
, gregexpr
, sub
,
gsub
and strsplit
.
In basic regular expressions the metacharacters ?
, +
,
{
, |
, (
, and )
lose their special meaning;
instead use the backslashed versions \?
, \+
,
\ {
, \|
, \(
, and \)
. Thus the
metacharacters are . \ [ ^ $ *
.
The perl = TRUE
argument to grep
, regexpr
,
gregexpr
, sub
, gsub
and strsplit
switches
to the PCRE library that ‘implements regular expression pattern
matching using the same syntax and semantics as Perl 5.6 or later,
with just a few differences’.
For complete details please consult the man pages for PCRE, especially
man pcrepattern
and man pcreapi
) on your system or from
the sources at
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/. If PCRE
support was compiled from the sources within R, the PCRE version is 6.2
as described here (version >= 4.0 is required even if R is
configured to use the system's PCRE library).
All the regular expressions described for extended regular expressions
are accepted except \<
and \>
: in Perl all backslashed
metacharacters are alphanumeric and backslashed symbols always are
interpreted as a literal character. {
is not special if it would
be the start of an invalid interval specification. There can be more than
9 backreferences.
The construct (?...)
is used for Perl extensions in a variety
of ways depending on what immediately follows the ?
.
Perl-like matching can work in several modes, set by the options
(?i)
(caseless, equivalent to Perl's /i
), (?m)
(multiline, equivalent to Perl's /m
), (?s)
(single line,
so a dot matches all characters, even new lines: equivalent to Perl's
/s
) and (?x)
(extended, whitespace data characters are
ignored unless escaped and comments are allowed: equivalent to Perl's
/x
). These can be concatenated, so for example, (?im)
sets caseless multiline matching. It is also possible to unset these
options by preceding the letter with a hyphen, and to combine setting
and unsetting such as (?im-sx)
. These settings can be applied
within patterns, and then apply to the remainder of the pattern.
Additional options not in Perl include (?U)
to set
‘ungreedy’ mode (so matching is minimal unless ?
is used,
when it is greedy). Initially none of these options are set.
If you want to remove the special meaning from a sequence of
characters, you can do so by putting them between \Q
and
\E
. This is different from Perl in that $
and @
are
handled as literals in \Q...\E
sequences in PCRE, whereas in
Perl, $
and @
cause variable interpolation.
The escape sequences \d
, \s
and \w
represent any
decimal digit, space character and ‘word’ character
(letter, digit or underscore in the current locale) respectively, and
their upper-case versions represent their negation.
Unlike POSIX and earlier versions of Perl and PCRE, vertical tab is
not regarded as a whitespace character.
Escape sequence \a
is BEL
, \e
is ESC
,
\f
is FF
, \n
is LF
, \r
is
CR
and \t
is TAB
. In addition \cx
is
cntrl-x
for any x
, \ddd
is the octal character
ddd
(for up to three digits unless interpretable as a
backreference), and \xhh
specifies a character in hex.
Outside a character class, \b
matches a word boundary,
\B
is its negation, \A
matches at start of a subject (even
in multiline mode, unlike ^
), \Z
matches at end of a
subject or before newline at end, \z
matches at end of a
subject. and \G
matches at first matching position in a
subject. \C
matches a single byte. including a newline.
The same repetition quantifiers as extended POSIX are supported.
However, if a quantifier is followed by ?
, the match is
‘ungreedy’, that is as short as possible rather than as long as
possible (unless the meanings are reversed by the (?U)
option.)
The sequence (?#
marks the start of a comment which continues
up to the next closing parenthesis. Nested parentheses are not
permitted. The characters that make up a comment play no part at all in
the pattern matching.
If the extended option is set, an unescaped #
character outside
a character class introduces a comment that continues up to the next
newline character in the pattern.
The pattern (?:...)
groups characters just as parentheses do
but does not make a backreference.
Patterns (?=...)
and (?!...)
are zero-width positive and
negative lookahead assertions: they match if an attempt to
match the ...
forward from the current position would succeed
(or not), but use up no characters in the string being processed.
Patterns (?<=...)
and (?<!...)
are the lookbehind
equivalents: they do not allow repetition quantifiers nor \C
in ...
.
Named subpatterns, atomic grouping, possessive qualifiers and conditional and recursive patterns are not covered here.
Prior to R 2.1.0 the implementation used was that of GNU grep 2.4.2
:
as from R 2.1.0 it is that of glibc 2.3.x
. The latter is more
strictly compliant and rejects some extensions that used to be
allowed.
The change was made both because bugs were becoming apparent in the previous code and to allow support of multibyte character sets.
This help page is based on the documentation of GNU grep 2.4.2 (from
which the C code used by R used to be taken) the pcre
man
page from PCRE 3.9 and the pcrepattern
man page from PCRE 4.4.
grep
, apropos
, browseEnv
,
help.search
, list.files
,
ls
and strsplit
.
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html