RE2，C++正则表达式库实战

RE2简介

RE2

RE2是,一个高效、原则性的正则表达式库，由Rob Pike和Russ Cox两位来自google的大牛用C++实现。他俩同时也是Go语言的主导者。Go语言中的regexp正则表达式包，也是RE2的Go实现。

RE2是，一个快速、安全，线程友好，PCRE、PERL和Python等回溯正则表达式引擎（backtracking regular expression engine）的一个替代品。RE2支持Linux和绝大多数的Unix平台，但不支持Windows（如果有必要，你可以自己hack）。

RE2的特点

回溯引擎（Backtracking engine）通常是典型的完整的功能和便捷的语法糖，但是即使很小的输入都可能强制进入指数级时间处理场景。RE2应用自动机理论理论，来保证在一个尺寸的输入上正则表达式搜索运行于一个时间线。RE2实现了内存限制，所以搜索可以被制约在一个固定大小的内存。RE2被设计为使用一个很小的固定C++堆栈足迹，无论它必须处理的输入或正则表达式是什么。从而RE2在多线程环境非常有用，当线程栈不能武断的增大时。

当输入（数据集）很大时，RE2通常比回溯引擎快很多。它采用自动机理论，实施别的引擎无法进行的优化。

不同于绝大多数基于自动机的引擎，RE2实现了几乎所有Perl和PCRE特点，和语法糖。它找到最左-优先（leftmost-first）匹配，同时匹配Perl可能匹配的，并且能返回子匹配信息。最明显的例外是，RE2去掉了对反向引用（backreferences）和一般性零-宽度断言（zero-width assertion）的支持，因为无法高效实现。

为了相对简单语法的使用者，RE2，有一个POSIX模式，仅接受POSIX egrep算子，实现最左-最长整体匹配（leftmost-longest overall matching）。

xkcd

¹ Technical note: there's a difference between submatches and backreferences. Submatches let you find out what certain subexpressions matched after the match is over, so that you can find out, after matching dogcat against (cat|dog)(cat|dog), that \1 is dog and \2 is cat. Backreferences let you use those subexpressions during the match, so that (cat|dog)\1 matches catcat and dogdog but not catdog or dogcat.

RE2支持子匹配萃取（submatch extraction），但是不支持反向引用（backreferences）。

如果你必须要反向引用和一般性断言，而RE2不支持，那么你可以看一下irregexp，Google Chrome的正则表达式引擎。

玩转RE2

安装

你可以下载发行版的代码包，然后解压进行安装。这里介绍，另一种安装方式：

需要安装Mercurial SCM和C++编译器（g++的克隆）：

下载代码，并进行安装：

hg clone http://re2.googlecode.com/hg re2cd re2make testmake testinstallsudo make install

在BSD系统, 使用gmake替换make

使用RE2库

使用RE2库开发C++应用，需要在代码中包含re2/re2.h头文件，链接时增加 -lre2以及-lpthread（多线环境使用）选项。

语法

在POSIX模式，RE@接受标准POSIX (egrep)语法正则表达式。在Perl模式，RE2接受大部分Perl操作符。唯一例外的是，那些要求回溯（潜在需要指数级的运行时）实现的部分。其中，包括反向引用（子匹配，还是支持的）和一般性断言。RE2,默认为Perl模式。

C++ 高级接口

这里包括两个基本的操作：

RE2::FullMatch: 要求regexp表达式匹配整个输入文本。
RE2::PartialMatch: 在输入文本中寻找一个子匹配。在POSIX模式，返回最左-最长匹配，Perl模式也是相同的匹配。

例如，

vi re2_high_interface_test.cc


#include <re2/re2.h>
#include <iostream>
#include <assert.h>int
main(void)
{assert(RE2::FullMatch("hello", "h.*o"));assert(!RE2::FullMatch("hello", "e"));assert(RE2::PartialMatch("hello", "h.*o"));assert(RE2::PartialMatch("hello", "e"));std::cout << "Ok" << std::endl;return 0;
}

编译程序：

 g++ -o re2_high_interface_test re2_high_interface_test.cc -lre2

执行re2_high_interface_test，程序正常运行，显示结果Ok。

子匹配萃取

两个匹配函数，都支持附加参数，来指定子匹配。此参数可以是一个字符串或一个整数类型或StringPiece类型。一个StringPiece是一个指向原始输入的指针,和一个字符串的长度计数。有点类似一个string，但是有自己的存储。和使用指针一样，当使用StringPiece时，你必须小心谨慎，原始文本已被删除或不在相同的边界时，不能使用。

示例：

vi re2_submatch_ex_test.cc


#include <re2/re2.h>
#include <iostream>
#include <assert.h>int
main(void)
{int i;std::string s;assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", &s, &i));assert(s == "ruby");assert(i == 1234);// Fails: "ruby" cannot be parsed as an integer.assert(!RE2::FullMatch("ruby", "(.+)", &i));// Success; does not extract the number.assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", &s));// Success; skips NULL argument.assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", (void*)NULL, &i));// Fails: integer overflow keeps value from being stored in i.assert(!RE2::FullMatch("ruby:123456789123", "(\\w+):(\\d+)", &s, &i));std::cout << "Ok" << std::endl;return 0;
}

g++ -o re2_submatch_ex_test re2_submatch_ex_test.cc -lre2

预编译的正则表达式

上面的示例都是每次调用的时编译一次正则表达式。相反，你可以编译一次正则表达式，保存到一个RE2对象中，然后在每次调用时重用这个对象。

示例:

vi re2_prec_re_test.cc


#include <re2/re2.h>
#include <iostream>
#include <assert.h>int
main(void)
{int i;std::string s;RE2 re("(\\w+):(\\d+)");assert(re.ok());  // compiled; if not, see re.error();assert(RE2::FullMatch("ruby:1234", re, &s, &i));assert(RE2::FullMatch("ruby:1234", re, &s));assert(RE2::FullMatch("ruby:1234", re, (void*)NULL, &i));assert(!RE2::FullMatch("ruby:123456789123", re, &s, &i));std::cout << "Ok" << std::endl;return 0;
}

g++ -o re2_prec_re_test re2_prec_re_test.cc -lre2

选项

RE2构造器还有第二个可选参数，可以用来改变RE2的默认选项。例如，预定义的Quiet选项，当正则表达式解析失败时，不打印错误消息：

vi re2_options_test.cc


#include <re2/re2.h>
#include <iostream>
#include <assert.h>int
main(void)
{RE2 re("(ab", RE2::Quiet);  // don't write to stderr for parser failureassert(!re.ok());  // can check re.error() for detailsstd::cout << "Ok" << std::endl;return 0;
}

编译程序：

g++ -o re2_options_test re2_options_test.cc -lre2

其他有用的预定义选项，是Latin1 (禁用UTF-8)和POSIX (使用POSIX语法和最左-最长匹配)。

你可以定义自己的RE2::Options对象，然后配置它。所有的选项在re2/re2.h文件中。

Unicode规范化

RE2操作Unicode的码点（code points）: 它没有试图进行规范化。例如，正则表达式/ü/(U+00FC, u和分音符)不匹配"ü"(U+0075 U+0308, u紧挨结合分音符)。规范化，是一个长期，参与的话题。最小的解决方案，如果你需要这样的匹配，是在使用RE2之前的处理环节中同时规范化正则表达式和输入。相关主题的更多细节，请参考http://www.unicode.org/reports/tr15/。

额外的技巧和窍门

RE2的高级应用技巧，如构造自己的参数列表，或将RE2作为词法分析器使用或解析十六进制、十进制和C-基数数字，请参考re2.h文件。

“回溯”与“非回溯”的区别

以下照片内容，源自“sregex: matching Perl 5 regexes on data streams”讲演文档.

回溯的意思

回溯方式实现

Robe Pike的算法

Thompson的构造的算法

RE2的各种包装

An Inferno wrapper is at http://code.google.com/p/inferno-re2/.

A Python wrapper is at http://github.com/facebook/pyre2/.

A Ruby wrapper is at http://github.com/axic/rre2/.

An Erlang wrapper is at http://github.com/tuncer/re2/.

A Perl wrapper is at http://search.cpan.org/~dgl/re-engine-RE2-0.05/lib/re/engine/RE2.pm.

An Eiffel wrapper is at http://sourceforge.net/projects/eiffelre2/.

RE2支持的语法

这里列出了RE2支持的正则表达式语法。同时，也列出了PCRE、PERL和VIM接受的语法。蓝色内容是，RE2不支持的语法。


Single characters:
`.`	any character, including newline (s=true)
`[xyz]`	character class
`[^xyz]`	negated character class
`\d`	Perl character class
`\D`	negated Perl character class
`[:alpha:]`	ASCII character class
`[:^alpha:]`	negated ASCII character class
`\pN`	Unicode character class (one-letter name)
`\p{Greek}`	Unicode character class
`\PN`	negated Unicode character class (one-letter name)
`\P{Greek}`	negated Unicode character class

Composites:
`xy`	`x` followed by `y`
`x\|y`	`x` or `y` (prefer `x`)

Repetitions:
`x`	zero or more `x`, prefer more
`x+`	one or more `x`, prefer more
`x?`	zero or one `x`, prefer one
`x{n,m}`	`n` or `n`+1 or ... or `m` `x`, prefer more
`x{n,}`	`n` or more `x`, prefer more
`x{n}`	exactly `n` `x`
`x?`	zero or more `x`, prefer fewer
`x+?`	one or more `x`, prefer fewer
`x??`	zero or one `x`, prefer zero
`x{n,m}?`	`n` or `n`+1 or ... or `m` `x`, prefer fewer
`x{n,}?`	`n` or more `x`, prefer fewer
`x{n}?`	exactly `n` `x`
`x{}`	(≡ `x`) (NOT SUPPORTED) VIM
`x{-}`	(≡ `x?`) (NOT SUPPORTED) VIM
`x{-n}`	(≡ `x{n}?`) (NOT SUPPORTED) VIM
`x=`	(≡ `x?`) (NOT SUPPORTED) VIM

Possessive repetitions:
`x+`	zero or more `x`, possessive (NOT SUPPORTED)
`x++`	one or more `x`, possessive (NOT SUPPORTED)
`x?+`	zero or one `x`, possessive (NOT SUPPORTED)
`x{n,m}+`	`n` or ... or `m` `x`, possessive (NOT SUPPORTED)
`x{n,}+`	`n` or more `x`, possessive (NOT SUPPORTED)
`x{n}+`	exactly `n` `x`, possessive (NOT SUPPORTED)

Grouping:
`(re)`	numbered capturing group
`(?Pre)`	named & numbered capturing group
`(?re)`	named & numbered capturing group (NOT SUPPORTED)
`(?'name're)`	named & numbered capturing group (NOT SUPPORTED)
`(?:re)`	non-capturing group
`(?flags)`	set flags within current group; non-capturing
`(?flags:re)`	set flags during re; non-capturing
`(?#text)`	comment (NOT SUPPORTED)
`(?\|x\|y\|z)`	branch numbering reset (NOT SUPPORTED)
`(?>re)`	possessive match of `re` (NOT SUPPORTED)
`re@>`	possessive match of `re` (NOT SUPPORTED) VIM
`%(re)`	non-capturing group (NOT SUPPORTED) VIM

Flags:
`i`	case-insensitive (default false)
`m`	multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
`s`	let `.` match `\n` (default false)
`U`	ungreedy: swap meaning of `x` and `x?`, `x+` and `x+?`, etc (default false)
Flag syntax is `xyz` (set) or `-xyz` (clear) or `xy-z` (set `xy`, clear `z`).

Empty strings:
`^`	at beginning of text or line (`m`=true)
`$`	at end of text (like `\z` not `\Z`) or line (`m`=true)
`\A`	at beginning of text
`\b`	at word boundary (`\w` on one side and `\W`, `\A`, or `\z` on the other)
`\B`	not a word boundary
`\G`	at beginning of subtext being searched (NOT SUPPORTED) PCRE
`\G`	at end of last match (NOT SUPPORTED) PERL
`\Z`	at end of text, or before newline at end of text (NOT SUPPORTED)
`\z`	at end of text
`(?=re)`	before text matching `re` (NOT SUPPORTED)
`(?!re)`	before text not matching `re` (NOT SUPPORTED)
`(?<=re)`	after text matching `re` (NOT SUPPORTED)
`(?<!re)`	after text not matching `re` (NOT SUPPORTED)
`re&`	before text matching `re` (NOT SUPPORTED) VIM
`re@=`	before text matching `re` (NOT SUPPORTED) VIM
`re@!`	before text not matching `re` (NOT SUPPORTED) VIM
`re@<=`	after text matching `re` (NOT SUPPORTED) VIM
`re@<!`	after text not matching `re` (NOT SUPPORTED) VIM
`\zs`	sets start of match (= \K) (NOT SUPPORTED) VIM
`\ze`	sets end of match (NOT SUPPORTED) VIM
`\%^`	beginning of file (NOT SUPPORTED) VIM
`\%$`	end of file (NOT SUPPORTED) VIM
`\%V`	on screen (NOT SUPPORTED) VIM
`\%#`	cursor position (NOT SUPPORTED) VIM
`\%'m`	mark `m` position (NOT SUPPORTED) VIM
`\%23l`	in line 23 (NOT SUPPORTED) VIM
`\%23c`	in column 23 (NOT SUPPORTED) VIM
`\%23v`	in virtual column 23 (NOT SUPPORTED) VIM

Escape sequences:
`\a`	bell (≡ `\007`)
`\f`	form feed (≡ `\014`)
`\t`	horizontal tab (≡ `\011`)
`\n`	newline (≡ `\012`)
`\r`	carriage return (≡ `\015`)
`\v`	vertical tab character (≡ `\013`)
`*`	literal , for any punctuation character
`\123`	octal character code (up to three digits)
`\x7F`	hex character code (exactly two digits)
`\x{10FFFF}`	hex character code
`\C`	match a single byte even in UTF-8 mode
`\Q...\E`	literal text `...` even if `...` has punctuation

`\1`	backreference (NOT SUPPORTED)
`\b`	backspace (NOT SUPPORTED) (use `\010`)
`\cK`	control char ^K (NOT SUPPORTED) (use `\001` etc)
`\e`	escape (NOT SUPPORTED) (use `\033`)
`\g1`	backreference (NOT SUPPORTED)
`\g{1}`	backreference (NOT SUPPORTED)
`\g{+1}`	backreference (NOT SUPPORTED)
`\g{-1}`	backreference (NOT SUPPORTED)
`\g{name}`	named backreference (NOT SUPPORTED)
`\g`	subroutine call (NOT SUPPORTED)
`\g'name'`	subroutine call (NOT SUPPORTED)
`\k`	named backreference (NOT SUPPORTED)
`\k'name'`	named backreference (NOT SUPPORTED)
`\lX`	lowercase `X` (NOT SUPPORTED)
`\ux`	uppercase `x` (NOT SUPPORTED)
`\L...\E`	lowercase text `...` (NOT SUPPORTED)
`\K`	reset beginning of `$0` (NOT SUPPORTED)
`\N{name}`	named Unicode character (NOT SUPPORTED)
`\R`	line break (NOT SUPPORTED)
`\U...\E`	upper case text `...` (NOT SUPPORTED)
`\X`	extended Unicode sequence (NOT SUPPORTED)

`\%d123`	decimal character 123 (NOT SUPPORTED) VIM
`\%xFF`	hex character FF (NOT SUPPORTED) VIM
`\%o123`	octal character 123 (NOT SUPPORTED) VIM
`\%u1234`	Unicode character 0x1234 (NOT SUPPORTED) VIM
`\%U12345678`	Unicode character 0x12345678 (NOT SUPPORTED) VIM

Character class elements:
`x`	single character
`A-Z`	character range (inclusive)
`\d`	Perl character class
`[:foo:]`	ASCII character class `foo`
`\p{Foo}`	Unicode character class `Foo`
`\pF`	Unicode character class `F` (one-letter name)

Named character classes as character class elements:
`[\d]`	digits (≡ `\d`)
`[^\d]`	not digits (≡ `\D`)
`[\D]`	not digits (≡ `\D`)
`[^\D]`	not not digits (≡ `\d`)
`[[:name:]]`	named ASCII class inside character class (≡ `[:name:]`)
`[^[:name:]]`	named ASCII class inside negated character class (≡ `[:^name:]`)
`[\p{Name}]`	named Unicode property inside character class (≡ `\p{Name}`)
`[^\p{Name}]`	named Unicode property inside negated character class (≡ `\P{Name}`)

Perl character classes:
`\d`	digits (≡ `[0-9]`)
`\D`	not digits (≡ `[^0-9]`)
`\s`	whitespace (≡ `[\t\n\f\r ]`)
`\S`	not whitespace (≡ `[^\t\n\f\r ]`)
`\w`	word characters (≡ `[0-9A-Za-z]`)
`\W`	not word characters (≡ `[^0-9A-Za-z]`)

`\h`	horizontal space (NOT SUPPORTED)
`\H`	not horizontal space (NOT SUPPORTED)
`\v`	vertical space (NOT SUPPORTED)
`\V`	not vertical space (NOT SUPPORTED)

ASCII character classes:
`[:alnum:]`	alphanumeric (≡ `[0-9A-Za-z]`)
`[:alpha:]`	alphabetic (≡ `[A-Za-z]`)
`[:ascii:]`	ASCII (≡ `[\x00-\x7F]`)
`[:blank:]`	blank (≡ `[\t ]`)
`[:cntrl:]`	control (≡ `[\x00-\x1F\x7F]`)
`[:digit:]`	digits (≡ `[0-9]`)
`[:graph:]`	graphical (≡ `[!-~] == [A-Za-z0-9!"#$%&'()+,-./:;<=>?@[\]^``</tt><tt>{\|}~]</tt>)</td></tr> <tr><td><tt>[:lower:]</tt></td><td>lower case (≡ <tt>[a-z]</tt>)</td></tr> <tr><td><tt>[:print:]</tt></td><td>printable (≡ <tt>[ -~] == [ [:graph:]]</tt>)</td></tr> <tr><td><tt>[:punct:]</tt></td><td>punctuation (≡ <tt>[!-/:-@[-</tt><tt>{-~]`)
`[:space:]`	whitespace (≡ `[\t\n\v\f\r ]`)
`[:upper:]`	upper case (≡ `[A-Z]`)
`[:word:]`	word characters (≡ `[0-9A-Za-z]`)
`[:xdigit:]`	hex digit (≡ `[0-9A-Fa-f]`)

Unicode character class names--general category:
`C`	other
`Cc`	control
`Cf`	format
`Cn`	unassigned code points (NOT SUPPORTED)
`Co`	private use
`Cs`	surrogate
`L`	letter
`LC`	cased letter (NOT SUPPORTED)
`L&`	cased letter (NOT SUPPORTED)
`Ll`	lowercase letter
`Lm`	modifier letter
`Lo`	other letter
`Lt`	titlecase letter
`Lu`	uppercase letter
`M`	mark
`Mc`	spacing mark
`Me`	enclosing mark
`Mn`	non-spacing mark
`N`	number
`Nd`	decimal number
`Nl`	letter number
`No`	other number
`P`	punctuation
`Pc`	connector punctuation
`Pd`	dash punctuation
`Pe`	close punctuation
`Pf`	final punctuation
`Pi`	initial punctuation
`Po`	other punctuation
`Ps`	open punctuation
`S`	symbol
`Sc`	currency symbol
`Sk`	modifier symbol
`Sm`	math symbol
`So`	other symbol
`Z`	separator
`Zl`	line separator
`Zp`	paragraph separator
`Zs`	space separator

Unicode character class names--scripts:
`Arabic`	Arabic
`Armenian`	Armenian
`Balinese`	Balinese
`Bengali`	Bengali
`Bopomofo`	Bopomofo
`Braille`	Braille
`Buginese`	Buginese
`Buhid`	Buhid
`Canadian_Aboriginal`	Canadian Aboriginal
`Carian`	Carian
`Cham`	Cham
`Cherokee`	Cherokee
`Common`	characters not specific to one script
`Coptic`	Coptic
`Cuneiform`	Cuneiform
`Cypriot`	Cypriot
`Cyrillic`	Cyrillic
`Deseret`	Deseret
`Devanagari`	Devanagari
`Ethiopic`	Ethiopic
`Georgian`	Georgian
`Glagolitic`	Glagolitic
`Gothic`	Gothic
`Greek`	Greek
`Gujarati`	Gujarati
`Gurmukhi`	Gurmukhi
`Han`	Han
`Hangul`	Hangul
`Hanunoo`	Hanunoo
`Hebrew`	Hebrew
`Hiragana`	Hiragana
`Inherited`	inherit script from previous character
`Kannada`	Kannada
`Katakana`	Katakana
`Kayah_Li`	Kayah Li
`Kharoshthi`	Kharoshthi
`Khmer`	Khmer
`Lao`	Lao
`Latin`	Latin
`Lepcha`	Lepcha
`Limbu`	Limbu
`Linear_B`	Linear B
`Lycian`	Lycian
`Lydian`	Lydian
`Malayalam`	Malayalam
`Mongolian`	Mongolian
`Myanmar`	Myanmar
`New_Tai_Lue`	New Tai Lue (aka Simplified Tai Lue)
`Nko`	Nko
`Ogham`	Ogham
`Ol_Chiki`	Ol Chiki
`Old_Italic`	Old Italic
`Old_Persian`	Old Persian
`Oriya`	Oriya
`Osmanya`	Osmanya
`Phags_Pa`	'Phags Pa
`Phoenician`	Phoenician
`Rejang`	Rejang
`Runic`	Runic
`Saurashtra`	Saurashtra
`Shavian`	Shavian
`Sinhala`	Sinhala
`Sundanese`	Sundanese
`Syloti_Nagri`	Syloti Nagri
`Syriac`	Syriac
`Tagalog`	Tagalog
`Tagbanwa`	Tagbanwa
`Tai_Le`	Tai Le
`Tamil`	Tamil
`Telugu`	Telugu
`Thaana`	Thaana
`Thai`	Thai
`Tibetan`	Tibetan
`Tifinagh`	Tifinagh
`Ugaritic`	Ugaritic
`Vai`	Vai
`Yi`	Yi

Vim character classes:
`\i`	identifier character (NOT SUPPORTED)/font> VIM
`\I`	`\i` except digits (NOT SUPPORTED) VIM
`\k`	keyword character (NOT SUPPORTED) VIM
`\K`	`\k` except digits (NOT SUPPORTED) VIM
`\f`	file name character (NOT SUPPORTED) VIM
`\F`	`\f` except digits (NOT SUPPORTED) VIM
`\p`	printable character (NOT SUPPORTED) VIM
`\P`	`\p` except digits (NOT SUPPORTED) VIM
`\s`	whitespace character (≡ `[ \t]`) (NOT SUPPORTED) VIM
`\S`	non-white space character (≡ `[^ \t]`) (NOT SUPPORTED) VIM
`\d`	digits (≡ `[0-9]`) VIM
`\D`	not `\d` VIM
`\x`	hex digits (≡ `[0-9A-Fa-f]`) (NOT SUPPORTED) VIM
`\X`	not `\x` (NOT SUPPORTED) VIM
`\o`	octal digits (≡ `[0-7]`) (NOT SUPPORTED) VIM
`\O`	not `\o` (NOT SUPPORTED) VIM
`\w`	word character VIM
`\W`	not `\w` VIM
`\h`	head of word character (NOT SUPPORTED) VIM
`\H`	not `\h` (NOT SUPPORTED) VIM
`\a`	alphabetic (NOT SUPPORTED) VIM
`\A`	not `\a` (NOT SUPPORTED) VIM
`\l`	lowercase (NOT SUPPORTED) VIM
`\L`	not lowercase (NOT SUPPORTED) VIM
`\u`	uppercase (NOT SUPPORTED) VIM
`\U`	not uppercase (NOT SUPPORTED) VIM
`_x`	`\x` plus newline, for any `x` (NOT SUPPORTED) VIM

Vim flags:
`\c`	ignore case (NOT SUPPORTED) VIM
`\C`	match case (NOT SUPPORTED) VIM
`\m`	magic (NOT SUPPORTED) VIM
`\M`	nomagic (NOT SUPPORTED) VIM
`\v`	verymagic (NOT SUPPORTED) VIM
`\V`	verynomagic (NOT SUPPORTED) VIM
`\Z`	ignore differences in Unicode combining characters (NOT SUPPORTED) VIM

Magic:
`(?{code})`	arbitrary Perl code (NOT SUPPORTED) PERL
`(??{code})`	postponed arbitrary Perl code (NOT SUPPORTED) PERL
`(?n)`	recursive call to regexp capturing group `n` (NOT SUPPORTED)
`(?+n)`	recursive call to relative group `+n` (NOT SUPPORTED)
`(?-n)`	recursive call to relative group `-n` (NOT SUPPORTED)
`(?C)`	PCRE callout (NOT SUPPORTED) PCRE
`(?R)`	recursive call to entire regexp (≡ `(?0)`) (NOT SUPPORTED)
`(?&name)`	recursive call to named group (NOT SUPPORTED)
`(?P=name)`	named backreference (NOT SUPPORTED)
`(?P>name)`	recursive call to named group (NOT SUPPORTED)
`(?(cond)true\|false)`	conditional branch (NOT SUPPORTED)
`(?(cond)true)`	conditional branch (NOT SUPPORTED)
`(ACCEPT)`	make regexps more like Prolog (NOT SUPPORTED)
`(COMMIT)`	(NOT SUPPORTED)
`(F)`	(NOT SUPPORTED)
`(FAIL)`	(NOT SUPPORTED)
`(MARK)`	(NOT SUPPORTED)
`(PRUNE)`	(NOT SUPPORTED)
`(SKIP)`	(NOT SUPPORTED)
`(THEN)`	(NOT SUPPORTED)
`(ANY)`	set newline convention (NOT SUPPORTED)
`(ANYCRLF)`	(NOT SUPPORTED)
`(CR)`	(NOT SUPPORTED)
`(CRLF)`	(NOT SUPPORTED)
`(LF)`	(NOT SUPPORTED)
`(BSR_ANYCRLF)`	set \R convention (NOT SUPPORTED) PCRE
`(*BSR_UNICODE)`	(NOT SUPPORTED) PCRE

扩展阅读

"perlre - Perl regular expressions" http://perldoc.perl.org/perlre.html
"Implementing Regular Expressions" http://swtch.com/~rsc/regexp
The re1 project: http://code.google.com/p/re1
The re2 project: http://code.google.com/p/re2
sregex: A non-backtracking regex engine matching on data streams
sregex: matching Perl 5 regexes on data streams: http://agentzh.org/misc/slides/yapc-na-2013-sregex.pdf